Why is matmul slower with gfortran compiler optimization turned on? - optimization

If I use gfortran (Homebrew GCC 8.2.0) on my Mac to compile the simple program below without optimization (-O0) the call to matmul consistently executes in ~90 milliseconds. If I use any optimization (flags -O1, -O2 or -O3) the execution time increases to ~250 milliseconds. I've tried using a wide range of different sizes for inVect and matrix but in all cases the -O0 option outperforms the other three optimization flags by at least a factor of 2.5. If I use smaller matrices with just a few hundred elements but loop over many calls to matmul the performance hit is even worse, close to a factor of 10.
Is there a way I can avoid this behavior? I need to use optimization in some portions of my code but, at the same time, I also would like to perform the matrix multiplication as efficiently as possible.
I compile the file sandbox.f90 containing the code below with the command gfortran -ON sandbox.f90, where N is an optimization level 0-3 (no other compiler flags are used). The first value of outVect is printed solely to keep the gfortran optimization from being clever and skipping the call to matmul altogether.
I'm Fortran novice so I apologize in advance if I am missing something obvious here.
program main
implicit none
real :: inVect(20000), matrix(20000,10000), outVect(10000)
real :: start, finish
call random_number(inVect)
call random_number(matrix)
call cpu_time(start)
outVect = matmul(inVect, matrix)
call cpu_time(finish)
print '("Time = ",f10.7," seconds. – First Value = ",f10.4)',finish-start,outVect(1)
end program main

First, consider that I may be wrong. I just saw this problem for the first time, and I'm as surprized as you.
I just studied this problem and I understand it as follow. The optimization -O0, O3, Ofast and... are written for most general (frequent) cases. However, in some cases (when -O3 is less efficient than -O*<-O3) the optimization induces a drawback. This is due to the fact that these optimizations call implicitly flags that induce a lower execution time for the specific task. For your case, the -O3 imposes, amongst other, that all matmul() function will be inlined. Such a thing is generally good, but not necessary true for big array or multiple call of this function. Somehow, the cost of inlining matmul() is more significant than the gain obtained for an inline function (at least this is how I see it).
To avoid this behavior, I suggest the use of the flag -O3 -finline-matmul-limit=0 which cancel the inlining of matmul function. Using the flag -O3 -finline-matmul-limit=0 leads to an execution time that is not worst than what is obtained for -O0.
You can use -finline-matmul-limit=n where you will inline the matmul function only if the involved array are smaller than n. I use n=0 for simplicity.
I hope that this help you.

Related

What is actually meant by parallel_iterations in tfp.mcmc.sample_chain?

I am not able to get what does the parameter parallel_iterations stand for in sampling multiple chains during MCMC.
The documentation for mcmc.sample_chain() doesn't give much details, it just says that
The parallel iterations are the number of iterations allowed to run in parallel. It must be a positive integer.
I am running a NUTS sampler with multiple chains while specifying parallel_iterations=8.
Does it mean that the chains are strictly run in parallel? Is the parallel execution dependent on multi-core support? If so, what is a good value (based on the number of cores) to set parallel_iterations? Should I naively set it to some higher value?
TensorFlow can unroll iterations of while loops to execute in parallel, when some parts of the data flow (I.e. iteration condition) can be computed faster than other parts. If you don't have a special preference (i.e. reproducibility with legacy stateful samplers), leave it at default.

Causes of floating point non-determinism? Including NumPy?

IEEE floating point operations are deterministic, but see How can floating point calculations be made deterministic? for one way that an overall floating point computation can be non-deterministic:
... parallel computations are non-deterministic in terms of the order in which floating-point computations are performed, which can result in non-bit-exact results across runs.
Two-part question:
How else can an overall floating point computation be non-deterministic, yielding results that are not exactly equal?
Consider a single-threaded Python program that calls NumPy, CVXOPT, and SciPy subroutines such as scipy.optimize.fsolve(), which in turn call native libraries like MINPACK and GLPK and optimized linear algebra subroutines like BLAS, ATLAS, and MKL. “If your numpy/scipy is compiled using one of these, then dot() will be computed in parallel (if this is faster) without you doing anything.”
Do these native libraries ever parallelize in a way that introduces non-deterministic results?
Assumptions:
The same software, with the same inputs, on the same hardware. The output of multiple runs should be equal.
If that works, it's highly desirable to test that the output after doing a code refactoring is equal. (Yes, some changes in order of operations can make some of the output not-equal.)
All random numbers in the program are psuedo-random numbers used in a consistent way from the same seeds across all runs.
No uninitialized values. Python is generally safe in that way but numpy.empty() returns a new array without initializing entries. And it's not clear that it's much faster in practice. So beware!
#PaulPanzer's test shows that numpy.empty() does return an uninitialized array and it can easily and quickly recycle a recent array:
import numpy as np
np.arange(100); np.empty(100, int); np.empty(100, int)
np.arange(100, 200.0); np.empty(100, float); np.empty(100, float)
It's tricky to get useful timing measurements for these routines! In a timeit loop, numpy.empty() can just keep reallocating the same one or two memory nodes. The time is independent of the array size. To prevent recycling:
from timeit import timeit
timeit('l.append(numpy.empty(100000))', 'import numpy; l = []')
timeit('l.append(numpy.zeros(100000))', 'import numpy; l = []')
but reducing that array size to numpy.zeros(10000) takes 15x as long; reducing it to numpy.zeros(1000) takes 1.3x as long (on my MBP). Puzzling.
See also:
Hash values are salted in Python 3 and each dict preserves insertion order. That could vary the order of operations from run to run. [I'm wrangling with this problem in Python 2.7.15.]
I found that most (not all) of the non-determinism problems I'm experiencing seem to be fixed in the code for OpenBLAS 0.3.5.
A bunch of threading problems in earlier versions of OpenBLAS are fixed in release 0.3.4, but that release has a macOS compatibility bug that's fixed in the code for release 0.3.5. The bugs also occurs with Apple's Accelerate framework version 1.1 and Intel's MKL mkl==2019.0.
See how to install OpenBLAS and compile NumPy and SciPy on it.
Perhaps the remaining problems I'm experiencing are due to other libraries linked to Accelerate?
Note: I'm still open to more answers to this question.

OpenCL: Type conversion overhead

What is the cost of casting a variable to a different type in OpenCL?
Example: I want to take dot product of 2 int3 vectors (AFAIK dot() isn't overloaded for int3s), so instead of implementing dot() by myself in unvectorized way, I want to vectorize the code by using the native dot() for float3. First I convert the 2 vectors to float3s and then I cast the result to int.
Which of the two functions, foo and bar, is less time consuming (and why)?
inline int foo(int3 a, int3 b) {
return a.x*b.x + a.y*b.y + a.z*b.z;
}
inline int bar(int3 a, int3 b) {
return (int)dot(convert_float3(a), convert_float3(b));
}
As has been suggested in the comments, measuring is going to be the most useful tool in practice, and the cost of individual instructions is heavily dependent on hardware architecture, but also the compiler.
Nevertheless, a comparison to other operations is useful, and at least AMD publishes a list of the instruction throughput for their devices in this section of their OpenCL optimisation guide, and this includes float-to-int and int-to-float conversion.
In your particular case, I strongly suspect your "vectorising" attempts will have detrimental effects. Most modern GPUs aren't SIMD processors in the CPU SIMD sense. The threads run in lock-step, but each thread operates on scalars. A "horizontal" operation like a dot product may not be particularly efficient even if the GPU does use per-thread SIMD.
If you can limit the range of each of your integers to 24 bits, a series of mad24() and mul24() calls will most likely be fastest. But again - measure. Try the different options on a range of hardware, and run them lots of times, applying basic stats to make sure you aren't just seeing random variation/overhead.
A separate thing to note with regard to integer-to-float conversions is that such conversions are often "free" when you sample as floats from an image object containing integers.

Is there a GPU accelerated numpy.max(X, axis=0) implementation in Theano?

Do we have a GPU accelerated of version of numpy.max(X, axis=None) in Theano.
I looked into the documentation and found theano.tensor.max(X, axis=None), but it is 4-5 times slower than the numpy implementation.
I can assure you, it is not slow because of some bad choice of matrix size. Same matrix under theano.tensor.exp is 40 times faster than its numpy counterpart.
Any suggestions?
The previous answer is partial. The suggestion should not work, as the work around is the one used in the final compiled code. There is optimization that will do this transformation automatically.
The title of the question isn't the same as the content. They differ by the axis argument. I'll answer both questions.
If the axis is 0 or None we support this on the GPU for that operation for matrix. If the axis is None, we have a basic implementation that isn't well optimized as it is harder to parallelize. If the axis is 0, we have a basic implementation, but it is faster as it is easier to parallelize.
Also, how did you do your timing? If you just make one function with only that operation and test it via the device=gpu flags to do your comparison, this will include the transfer time between CPU and GPU. This is a memory bound operation, so if you include the transfer in your timming, personnaly I don't expect any speed op for that case. To see only the GPU operation, use Theano profiler: run with the Theano flag profile=True.
The max and exp operations are fundamentally different; exp (and other operations like addition, sin, etc.) is an elementwise operation that is embarrassingly parallelizable, while max requires a parallel-processing scan algorithm that basically builds up a tree of pairwise comparisons over an array. It's not impossible to speed up max, but it's not as easy as exp.
Anyway, the theano implementation of max basically consists of the following lines (in theano/tensor/basic.py):
try:
out = max_and_argmax(x, axis)[0]
except Exception:
out = CAReduce(scal.maximum, axis)(x)
where max_and_argmax is a bunch of custom code that, to my eye, implements a max+argmax operation using numpy, and CAReduce is a generic GPU-accelerated scan operation used as a fallback (which, according to the comments, doesn't support grad etc.). You could try using the fallback directly and see whether that is faster, maybe something like this:
from theano.tensor.elemwise import CAReduce
from theano.scalar import maximum
def mymax(X, axis=None):
CAReduce(maximum, axis)(X)

Are there compilers that optimise floating point operations for accuracy (as opposed to speed)?

We know that compilers are getting better and better at optimising our code and make it run faster, but my question are there compilers that can optimise floating point operations to ensure greater accuracy.
For example a basic rule is to perform multiplications before addition, this is because multiplication and division using floating point numbers does not introduce inaccuracies as great as that of addition and subtraction but can increase the magnitude of inaccuracies introduced by addition and subtraction, so it should be done first in many cases.
So a floating point operation like
y = x*(a + b); // faster but less accurate
Should be changed to
y = x*a + x*b; // slower but more accurate
Are there any compilers that will optimise for improved floating point accuracy at the expense of speed like I showed above? Or is the main concern of compilers speed with out looking at accuracy of floating point operations?
Thanks
Update: The selected answer, showed a very good example where this type of optimisation would not work, so it wouldn't be possible for the compiler to know before hand what is the more accurate way to evaluate y. Thanks for the counter example.
Your premise is faulty. x*(a + b), is (in general) no less accurate than x*a + x*b. In fact, it will often be more accurate, because it performs only two floating point operations (and therefore incurs only two rounding errors), whereas the latter performs three operations.
If you know something about the expected distribution of values for x, a, and b a priori, then you could make an informed decision, but compilers almost never have access to that type of information.
That aside, what if the person writing the program actually meant x*(a+b) and specifically wanted the exactly roundings that are caused by that particular sequence of operations? This sort of thing is actually pretty common in high-quality numerical algorithms.
Better to do what the programmer wrote, not what you think he might have intended.
Edit -- An example to illustrate a case where the transformation you suggested results in a catastrophic loss of accuracy: suppose
x = 3.1415926535897931
a = 1.0e15
b = -(1.0e15 - 1.0)
Then, evaluating in double we get:
x*(a + b) = 3.1415926535897931
but
x*a + x*b = 3.0
Compilers typically "optimize" for accuracy over speed, accuracy defined as exact implementation of the IEEE 754 standard. Whereas integer operations can be reordered in any way that doesn't cause overflow, FP operations need to be performed exactly as the programmer specifies. This may sacrifice numerical accuracy (ordinary C compilers are not equipped to optimize for that) but faithfully implements the what the programmer asked.
A programmer who is sure he hasn't manually optimized for accuracy may enable compiler features like GCC's -funsafe-math-optimizations and -ffinite-math-only to possibly extract extra speed. But usually there isn't much gain.
No, there isn't. Stephen Canon gives some good reasons why this would be a stupid idea, and he's correct; so you won't find a compiler that does this.
If you as the programmer have some knowledge about the ranges of numbers you're manipulating, you can use parentheses, temporary variables and similar constructs to strongly hint the compiler about how you want things done.