How to enable Loop tiling in gcc? - optimization

How to compile a code using gcc, which performs loop tiling (Blocking) ? The -O3 optimization by default does not do loop tiling. I need to enable loop tiling in this flag and also, find out the tile factor. (E.g. cubic tiling or rectangular tiling) i.e. the internal tiling heuristics .
Thanks

You haven't provided the exact version of gcc, nor example code, nor result code, nor did you look hard enough at the internet, but possibly this already answers your question:
Strip mining is an optimization that has been introduced into gcc with the merge of the graphite branch in version 4.4. See also the manual:
-floop-strip-mine
Perform loop strip mining transformations on loops. Strip mining splits a loop into two nested loops. The outer loop has strides equal to the strip size and the inner loop has strides of the original loop within a strip. The strip length can be changed using the loop-block-tile-size parameter. For example, given a loop like:
DO I = 1, N
A(I) = A(I) + C
ENDDO
loop strip mining will transform the loop as if the user had written:
DO II = 1, N, 51
DO I = II, min (II + 50, N)
A(I) = A(I) + C
ENDDO
ENDDO
This optimization applies to all the languages supported by GCC and is not limited to Fortran. To use this code transformation, GCC has to be configured with --with-ppl and --with-cloog to enable the Graphite loop transformation infrastructure.
You may run man gcc | grep '\-floop\-strip\-mine' to check if that is a supported option. For the exact gcc version, type gcc --version.

Related

Why is matmul slower with gfortran compiler optimization turned on?

If I use gfortran (Homebrew GCC 8.2.0) on my Mac to compile the simple program below without optimization (-O0) the call to matmul consistently executes in ~90 milliseconds. If I use any optimization (flags -O1, -O2 or -O3) the execution time increases to ~250 milliseconds. I've tried using a wide range of different sizes for inVect and matrix but in all cases the -O0 option outperforms the other three optimization flags by at least a factor of 2.5. If I use smaller matrices with just a few hundred elements but loop over many calls to matmul the performance hit is even worse, close to a factor of 10.
Is there a way I can avoid this behavior? I need to use optimization in some portions of my code but, at the same time, I also would like to perform the matrix multiplication as efficiently as possible.
I compile the file sandbox.f90 containing the code below with the command gfortran -ON sandbox.f90, where N is an optimization level 0-3 (no other compiler flags are used). The first value of outVect is printed solely to keep the gfortran optimization from being clever and skipping the call to matmul altogether.
I'm Fortran novice so I apologize in advance if I am missing something obvious here.
program main
implicit none
real :: inVect(20000), matrix(20000,10000), outVect(10000)
real :: start, finish
call random_number(inVect)
call random_number(matrix)
call cpu_time(start)
outVect = matmul(inVect, matrix)
call cpu_time(finish)
print '("Time = ",f10.7," seconds. – First Value = ",f10.4)',finish-start,outVect(1)
end program main
First, consider that I may be wrong. I just saw this problem for the first time, and I'm as surprized as you.
I just studied this problem and I understand it as follow. The optimization -O0, O3, Ofast and... are written for most general (frequent) cases. However, in some cases (when -O3 is less efficient than -O*<-O3) the optimization induces a drawback. This is due to the fact that these optimizations call implicitly flags that induce a lower execution time for the specific task. For your case, the -O3 imposes, amongst other, that all matmul() function will be inlined. Such a thing is generally good, but not necessary true for big array or multiple call of this function. Somehow, the cost of inlining matmul() is more significant than the gain obtained for an inline function (at least this is how I see it).
To avoid this behavior, I suggest the use of the flag -O3 -finline-matmul-limit=0 which cancel the inlining of matmul function. Using the flag -O3 -finline-matmul-limit=0 leads to an execution time that is not worst than what is obtained for -O0.
You can use -finline-matmul-limit=n where you will inline the matmul function only if the involved array are smaller than n. I use n=0 for simplicity.
I hope that this help you.

Is there a blas implementation using cilkplus array notation?

To my surprise, I'm not able to track on the web any implementation of BLAS based on cilkplus' array notation. It is strange, because cilkplus should ensure a (more than) decent performance on today's multicore workstation CPUs, coupled to a very expressive and compact representation of the BLAS algorithms. Even more strange, considering that BLAS/LAPACK is the de facto standard for dense matrix calculations (at least, as specification).
I understand that there are other more recent and sofisticate libraries that try to improve/extend the blas/lapack, for example I've looked at eigen and flens, but still it would be nice to have a cilkplus version of the "standard" blas implementation.
Is this depending by a very limited spread of cilkplus?
http://parallelbook.com/downloads has Cilk Plus code (see "CODE EXAMPLES FROM BOOK") for a few BLAS operations in a Cholesky decomposition example: gemm, portrf, syrk, and trsm. The routines are templates, so they work for any precision.
On the plus side, the Cilk Plus versions give you good composition properties, i.e. you can use them in separate parts of a spawn tree without worry. On the negative side, if you don't need the clean composition, then it's hard to compete with highly tuned parallel BLAS libraries, because the Cilk Plus algorithms tend to be cache oblivious, whereas the highly tuned libraries can exploit cache awareness. E.g., a cache aware algorithm can carefully schedule multiple threads on the same core to work on the same blocks, and thus save memory fetch overhead. It's a lot of work to get the cache awareness right for each machine, but BLAS authors are willing to do the work.
It's exactly the cache awareness ("I own the whole machine" programming) that thwarts clean composition, so you can't have both.
For some BLAS operations, the fork-join structure of Cilk Plus also seems to limit performance compared to less structured parallelism. See slide 2 of http://www.netlib.org/utk/people/JackDongarra/WEB-PAGES/cscads-libtune-09/talk17-knobe.pdf for some examples.
Taking gemm as example, at the end the parallel routine is just calling the blas (sgemm, dgemm, etc.) routine. This might be the netlib reference, or atlas, or openblas, or mkl, but this is opaque in the suggested citation. I was asking for the existence of cilkplus implementation of the reference routine, e.g. something like
void dgemm(MATRIX & A, MATRIX & B, MATRIX & C) {
#pragma cilk grainsize = 64
cilk_for(int i = 1; i <= A.rows; i++) {
double *x = &A(i, 1);
for (int j = 1; j <= A.cols; j++, x += A.colstride)
ROW(C, i) += (*x) * ROW(B, j);
}
}

Vector functions of STREAM benchmark

I am currently doing a small research project for school, where I am to test the memory performance bandwidth of a Hypervisor, compared to the virtualised machines it creates and manages.
Due to the timeframe of the project, only one of the vector functions tested by STREAM will be analysed. My thoughtprocess is to look at the results from the "Copy" function, since this is the most basic function, which performs no arithmetic, as stated at the bottom of https://www.cs.virginia.edu/stream/ref.html
After all, this is a memory bandwidth performance test.
I have yet though to find any google post that proves, or disproves my theory. Is there anyone here who can shine some light on this topic?
STREAM Copy and other three tests are usually written in plain C without explicit vectorization. But the loops are simple and most compilers are able to optimize them to vectorized variant. The kernel line in https://www.cs.virginia.edu/stream/ref.html is the full code of loop, and there are three arrays: a, b, c of the same size; preinitialized with some floating point data. Element of vector is double (8 bytes typical).
The table below shows how many Bytes and FLOPs are counted in each iteration of the STREAM loops.
The test consists of multiple repetitions of four the kernels, and the best results of (typically) 10 trials are chosen.
------------------------------------------------------------------
name kernel bytes/iter FLOPS/iter
------------------------------------------------------------------
COPY: a(i) = b(i) 16 0
SCALE: a(i) = q*b(i) 16 1
SUM: a(i) = b(i) + c(i) 24 1
TRIAD: a(i) = b(i) + q*c(i) 24 2
------------------------------------------------------------------
More recent variants of the test are NERSC: http://www.nersc.gov/users/computational-systems/cori/nersc-8-procurement/trinity-nersc-8-rfp/nersc-8-trinity-benchmarks/stream/ and HPCC: http://icl.cs.utk.edu/hpcc/ both based on http://www.cs.virginia.edu/stream/

Fastest way to multiply X*X.transpose() in Eigen?

I want to multiple matrix with self transposed. The size of matrix about X[8, 100].
Now it looks " MatrixXf h = X*X.transpose()"
a) Is it possible to use faster multiplication using explicit facts:
Result matrix is symmetric
The X matrix use same data so, can use custom procedure for multiplication.
?
b)Also i can generate X matrix as transposed and use X.transpose()*X, whitch i should prefer for my dimensions ?
c) Any tips on faster multiplication of such matrixes.
Thanks.
(a) Your matrix is too small to take advantage of the symmetry of the result because if you do so, then you will loose vectorization. So there is not much you can do.
(b) The default column storage should be fine for that example.
(c) Make sure you compile with optimizations ON, that you enabled SSE2 (this is the default on 64 bits systems), the devel branch is at least twice as fast for such sizes, and you can get additional speedup by enabling AVX.

numpy correlation coefficient: np.dot(A, A.T) on large arrays causing seg fault

NOTE:
Speed is not as important as getting a final result.
However, some speed up over worst case is required as well.
I have a large array A:
A.shape=(20000,265) # or possibly larger like 50,000 x 265
I need to compute the correlation coefficients.
np.corrcoeff # internally casts the results as doubles
I just borrowed their code and wrote my own cov/corr not casting into doubles, since I really only need 32 bit floats.And I ditch the conj() since my data are always real.
cov = A.dot(A.T)/n #where A is an array of 32 bit floats
diag = np.diag(cov)
corr = cov / np.sqrt(np.mutliply.outer(d,d))
I still run out of memory and I'm using a large memory machine, 264GB
I've been told, that the fast C libraries, are probably using a routine which breaks the
dot product up into pieces, and to optimize this, the number of elements is padded to a power of 2.
I don't really need to compute the symmetric half of the correlation coefficient matrix.
However, I don't see a way to do this in reasonable amount of time doing it "manually", with python loops.
Does anybody know of a way to ask numpy for a decent dot product routine, that balances memory usage with speed...?
Cheers
UPDATE:
Funny how writing these questions tends to help me find the language for a better google query.
Found this:
http://wiki.scipy.org/PerformanceTips
Not sure that I follow it....so, please comment or provide answers about this solution, your own ideas, or just general commentary on this type of problem.
TIA
EDIT: I apologize because my array is really much bigger than I thought.
array size is actually 151,000 x 265
I''m running out of memory on a machine with 264 GB with at least 230 GB free.
I'm surprised that the numpy call to blas dgemm and being careful with C order arrays
didn't do squat.
Python compiled with intel's mkl will run this with 12GB of memory in about 30 seconds:
>>> A = np.random.rand(50000,265).astype(np.float32)
>>> A.dot(A.T)
array([[ 86.54410553, 64.25226593, 67.24698639, ..., 68.5118103 ,
64.57299805, 66.69223785],
...,
[ 66.69223785, 62.01016235, 67.35866547, ..., 66.66306305,
65.75863647, 86.3017807 ]], dtype=float32)
If you do not have access to in intel's MKL download python anaconda and install the accelerate package which has a trial version for 30 days or free for academics that contains a mkl compile. Various other C++ BLAS libraries should work also- even if it copies the array from C to F it should not take more then ~30GB of memory.
The only thing that I can think of that your installation is trying to do is try to hold the entire 50,000 x 50,000 x 265 array in memory which is quite frankly terrible. For reference a float32 50,000 x 50,000 array is only 10GB, while the aforementioned array is 2.6TB...
If its a gemm issue you can try a chunk gemm formula:
def chunk_gemm(A, B, csize):
out = np.empty((A.shape[0],B.shape[1]), dtype=A.dtype)
for i in xrange(0, A.shape[0], csize):
iend = i+csize
for j in xrange(0, B.shape[1], csize):
jend = j+csize
out[i:iend, j:jend] = np.dot(A[i:iend], B[:,j:jend])
return out
This will be slower, but will hopefully get over your memory issues.
You can try and see if np.einsum works better than dot for your case:
cov = np.einsum('ij,kj->ik', A, A) / n
The internal workings of dot are a little obscure, as it tries to use BLAS optimized routines, which sometimes require copies of arrays to be in Fortran order, not sure if that's the case here. einsum will buffer its inputs, and use vectorized SIMD operations where possible, but outside that it is basically going to run the naive three nested loops to compute the matrix product.
UPDATE: Turns out the dot product completed with out error, but upon careful inspection
the output array consists of zeros at 95,000 to the end, of the 151,000 cols.
That is, out[:,94999] = non-zero but out[:,95000] = 0 for all rows...
This is super annoying...
Another Blas description
The exchange, mentions something that I thought about too...Since blas is fortran, shouldn't
the order of the input be F order...? Where as the scipy doc page below, says C order.
Trying F order caused a segmentation fault. So I'm back to square one.
ORIGINAL POST
I finally tracked down my problem, which was in the details as usual.
I'm using an array of np.float32 which were stored as F order. I can't control the F order to my knowledge, since the data is loaded from images using an imaging library.
import scipy
roi = np.ascontiguousarray( roi )# see roi.flags below
out = scipy.linalg.blas.sgemm(alpha=1.0, a=roi, b=roi, trans_b=True)
This level 3 blas routine does the trick. My problem was two fold:
roi.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
And... i was using blas dgemm NOT sgemm. The 'd' is for 'double' and 's' for 'single'.
See this pdf: BLAS summary pdf
I looked at it once and was overwhelmed...I went back and read the wikipedia article on blas routines to understand level 3 vs other levels: wikipedia article on blas
Now it works on A = 150,000 x 265, performing:
A \dot A.T
Thanks everyone for your thoughts...knowing that it could be done was most important.