Is there a blas implementation using cilkplus array notation? - blas

To my surprise, I'm not able to track on the web any implementation of BLAS based on cilkplus' array notation. It is strange, because cilkplus should ensure a (more than) decent performance on today's multicore workstation CPUs, coupled to a very expressive and compact representation of the BLAS algorithms. Even more strange, considering that BLAS/LAPACK is the de facto standard for dense matrix calculations (at least, as specification).
I understand that there are other more recent and sofisticate libraries that try to improve/extend the blas/lapack, for example I've looked at eigen and flens, but still it would be nice to have a cilkplus version of the "standard" blas implementation.
Is this depending by a very limited spread of cilkplus?

http://parallelbook.com/downloads has Cilk Plus code (see "CODE EXAMPLES FROM BOOK") for a few BLAS operations in a Cholesky decomposition example: gemm, portrf, syrk, and trsm. The routines are templates, so they work for any precision.
On the plus side, the Cilk Plus versions give you good composition properties, i.e. you can use them in separate parts of a spawn tree without worry. On the negative side, if you don't need the clean composition, then it's hard to compete with highly tuned parallel BLAS libraries, because the Cilk Plus algorithms tend to be cache oblivious, whereas the highly tuned libraries can exploit cache awareness. E.g., a cache aware algorithm can carefully schedule multiple threads on the same core to work on the same blocks, and thus save memory fetch overhead. It's a lot of work to get the cache awareness right for each machine, but BLAS authors are willing to do the work.
It's exactly the cache awareness ("I own the whole machine" programming) that thwarts clean composition, so you can't have both.
For some BLAS operations, the fork-join structure of Cilk Plus also seems to limit performance compared to less structured parallelism. See slide 2 of http://www.netlib.org/utk/people/JackDongarra/WEB-PAGES/cscads-libtune-09/talk17-knobe.pdf for some examples.

Taking gemm as example, at the end the parallel routine is just calling the blas (sgemm, dgemm, etc.) routine. This might be the netlib reference, or atlas, or openblas, or mkl, but this is opaque in the suggested citation. I was asking for the existence of cilkplus implementation of the reference routine, e.g. something like
void dgemm(MATRIX & A, MATRIX & B, MATRIX & C) {
#pragma cilk grainsize = 64
cilk_for(int i = 1; i <= A.rows; i++) {
double *x = &A(i, 1);
for (int j = 1; j <= A.cols; j++, x += A.colstride)
ROW(C, i) += (*x) * ROW(B, j);
}
}

Related

Causes of floating point non-determinism? Including NumPy?

IEEE floating point operations are deterministic, but see How can floating point calculations be made deterministic? for one way that an overall floating point computation can be non-deterministic:
... parallel computations are non-deterministic in terms of the order in which floating-point computations are performed, which can result in non-bit-exact results across runs.
Two-part question:
How else can an overall floating point computation be non-deterministic, yielding results that are not exactly equal?
Consider a single-threaded Python program that calls NumPy, CVXOPT, and SciPy subroutines such as scipy.optimize.fsolve(), which in turn call native libraries like MINPACK and GLPK and optimized linear algebra subroutines like BLAS, ATLAS, and MKL. “If your numpy/scipy is compiled using one of these, then dot() will be computed in parallel (if this is faster) without you doing anything.”
Do these native libraries ever parallelize in a way that introduces non-deterministic results?
Assumptions:
The same software, with the same inputs, on the same hardware. The output of multiple runs should be equal.
If that works, it's highly desirable to test that the output after doing a code refactoring is equal. (Yes, some changes in order of operations can make some of the output not-equal.)
All random numbers in the program are psuedo-random numbers used in a consistent way from the same seeds across all runs.
No uninitialized values. Python is generally safe in that way but numpy.empty() returns a new array without initializing entries. And it's not clear that it's much faster in practice. So beware!
#PaulPanzer's test shows that numpy.empty() does return an uninitialized array and it can easily and quickly recycle a recent array:
import numpy as np
np.arange(100); np.empty(100, int); np.empty(100, int)
np.arange(100, 200.0); np.empty(100, float); np.empty(100, float)
It's tricky to get useful timing measurements for these routines! In a timeit loop, numpy.empty() can just keep reallocating the same one or two memory nodes. The time is independent of the array size. To prevent recycling:
from timeit import timeit
timeit('l.append(numpy.empty(100000))', 'import numpy; l = []')
timeit('l.append(numpy.zeros(100000))', 'import numpy; l = []')
but reducing that array size to numpy.zeros(10000) takes 15x as long; reducing it to numpy.zeros(1000) takes 1.3x as long (on my MBP). Puzzling.
See also:
Hash values are salted in Python 3 and each dict preserves insertion order. That could vary the order of operations from run to run. [I'm wrangling with this problem in Python 2.7.15.]
I found that most (not all) of the non-determinism problems I'm experiencing seem to be fixed in the code for OpenBLAS 0.3.5.
A bunch of threading problems in earlier versions of OpenBLAS are fixed in release 0.3.4, but that release has a macOS compatibility bug that's fixed in the code for release 0.3.5. The bugs also occurs with Apple's Accelerate framework version 1.1 and Intel's MKL mkl==2019.0.
See how to install OpenBLAS and compile NumPy and SciPy on it.
Perhaps the remaining problems I'm experiencing are due to other libraries linked to Accelerate?
Note: I'm still open to more answers to this question.

Explaining the different types in Metal and SIMD

When working with Metal, I find there's a bewildering number of types and it's not always clear to me which type I should be using in which context.
In Apple's Metal Shading Language Specification, there's a pretty clear table of which types are supported within a Metal shader file. However, there's plenty of sample code available that seems to use additional types that are part of SIMD. On the macOS (Objective-C) side of things, the Metal types are not available but the SIMD ones are and I'm not sure which ones I'm supposed to be used.
For example:
In the Metal Spec, there's float2 that is described as a "vector" data type representing two floating components.
On the app side, the following all seem to be used or represented in some capacity:
float2, which is typedef ::simd_float2 float2 in vector_types.h
Noted: "In C or Objective-C, this type is available as simd_float2."
vector_float2, which is typedef simd_float2 vector_float2
Noted: "This type is deprecated; you should use simd_float2 or simd::float2 instead"
simd_float2, which is typedef __attribute__((__ext_vector_type__(2))) float simd_float2
::simd_float2 and simd::float2 ?
A similar situation exists for matrix types:
matrix_float4x4, simd_float4x4, ::simd_float4x4 and float4x4,
Could someone please shed some light on why there are so many typedefs with seemingly overlapping functionality? If you were writing a new application today (2018) in Objective-C / Objective-C++, which type should you use to represent two floating values (x/y) and which type for matrix transforms that can be shared between app code and Metal?
The types with vector_ and matrix_ prefixes have been deprecated in favor of those with the simd_ prefix, so the general guidance (using float4 as an example) would be:
In C code, use the simd_float4 type. (You have to include the prefix unless you provide your own typedef, since C doesn't have namespaces.)
Same for Objective-C.
In C++ code, use the simd::float4 type, which you can shorten to float4 by using namespace simd;.
Same for Objective-C++.
In Metal code, use the float4 type, since float4 is a fundamental type in the Metal Shading Language [1].
In Swift code, use the float4 type, since the simd_ types are typealiased to shorter names.
Update: In Swift 5, float4 and related types have been deprecated in favor of SIMD4<Float> and related types.
These types are all fundamentally equivalent, and all have the same size and alignment characteristics so you can use them across languages. That is, in fact, one of the design goals of the simd framework.
I'll leave a discussion of packed types to another day, since you didn't ask.
[1] Metal is an unusual case since it defines float4 in the global namespace, then imports it into the metal namespace, which is also exported as the simd namespace. It additionally aliases float4 as vector_float4. So, you can use any of the above names for this vector type (except simd_float4). Prefer float4.
which type should you use to represent two floating values (x/y)
If you can avoid it, don't use a single SIMD vector to represent a single geometry x,y vector if you're using CPU SIMD.
CPU SIMD works best when you have many of the same thing in each SIMD vector, because they're actually stores in 16-byte or 32-byte vector registers where "vertical" operations between two vectors are cheap (packed add or multiply), but "horizontal" operations can mostly only be done with a shuffle + a vertical operation.
For example a vector of 4 x values and another vector of 4 y values lets you do 4 dot-products or 4 cross-products in parallel with no shuffling, so the overall throughput is significantly more dot-products per clock cycle than if you had a vector of [x1, y1, x2, y2].
See https://stackoverflow.com/tags/sse/info, and especially these slides: SIMD at Insomniac Games (GDC 2015) for more about planning your data layout and program design for doing many similar operations in parallel instead of trying to accelerate single operations.
The one exception to this rule is if you're only adding / subtracting to translate coordinates, because that's still purely a vertical operation even with an array-of-structs. And thus fine for CPU short-vector SIMD based on 16-byte vectors. (e.g. the 2nd element in one vector only interacts with the 2nd element in another vector, so no shuffling is needed.)
GPU SIMD is different, and I think has no problem with interleaved data. I'm not a GPU expert.
(I don't use Objective C or Metal, so I can't help you with the details of their type names, just what the underlying CPU hardware is good at. That's basically the same for x86 SSE/AVX, ARM NEON / AArch64 SIMD, or PowerPC Altivec. Horizontal operations are slower.)

OpenCL 2.x - Sum Reduction function

From this previous post: strategy-for-doing-final-reduction, I would like to know the last functionalities offered by OpenCL 2.x (not 1.x which is the subject of this previous post above), especially about the atomic functions which allow to perform reductions of a array (in my case a sum reduction).
One told me that performances of OpenCL 1.x atomic functions (atom_add) were bad and I could check it, so I am looking for a way to get the best performances for a final reduction function (i.e the sum of each computed sum corresponding to each work-group).
I recall the typical kind of kernel code that I am using for the moment :
__kernel void sumGPU ( __global const double *input,
__global double *partialSums,
__local double *localSums)
{
uint local_id = get_local_id(0);
uint group_size = get_local_size(0);
// Copy from global memory to local memory
localSums[local_id] = input[get_global_id(0)];
// Loop for computing localSums
for (uint stride = group_size/2; stride>0; stride /=2)
{
// Waiting for each 2x2 addition into given workgroup
barrier(CLK_LOCAL_MEM_FENCE);
// Divide WorkGroup into 2 parts and add elements 2 by 2
// between local_id and local_id + stride
if (local_id < stride)
localSums[local_id] += localSums[local_id + stride];
}
// Write result into partialSums[nWorkGroups]
if (local_id == 0)
partialSums[get_group_id(0)] = localSums[0];
}
As you can see, at the end of kernel code execution, I get the array partialSums[number_of_workgroups] containing all partial sums.
Could you tell me please how to perform a second and final reduction of this array, with the best performances possibles of functions availables with OpenCL 2.x . A classic solution is to perform this final reduction with CPU but ideally, I would like to do it directly with kernel code.
A suggestion of code snippet is welcome.
A last point, I am working on MacOS High Sierra 10.13.5 with the following model :
Can OpenCL 2.x be installed on my hardware MacOS model ?
Atomic functions should be avoided because they do harm performance compared to a parallel reduction kernel. Your kernel looks to be on the right track, but you need to remember that you'll have to invoke it multiple times; do not perform the final sum on the host (unless you have a very small amount of data from the previous reduction). That is, you need to keep invoking it until your local size equals your global size. There's no way to do a single invocation for large amounts of data as there is no way to synchronize between work groups.
Additionally, you want to be careful to set an appropriate work group size (i.e. local size), which depends on local & global memory throughput & latency. Unfortunately, as far as I'm aware there is no way to determine this through OpenCL, outside of self-profiling code, though that's not too difficult to write as OCL provides you with JIT compilation. Through empirical testing I've found you should find a sweet spot between suffering too many bank conflicts (too large a local size) vs. global memory latency penalties (too small a local size). It's best to do a benchmark first to determine optimal local size for your reduction, and then use that local size for future reductions.
Edit: It's also worth noting that the best way to chain your kernel invocation together is through OpenCL events.

OpenCL: Type conversion overhead

What is the cost of casting a variable to a different type in OpenCL?
Example: I want to take dot product of 2 int3 vectors (AFAIK dot() isn't overloaded for int3s), so instead of implementing dot() by myself in unvectorized way, I want to vectorize the code by using the native dot() for float3. First I convert the 2 vectors to float3s and then I cast the result to int.
Which of the two functions, foo and bar, is less time consuming (and why)?
inline int foo(int3 a, int3 b) {
return a.x*b.x + a.y*b.y + a.z*b.z;
}
inline int bar(int3 a, int3 b) {
return (int)dot(convert_float3(a), convert_float3(b));
}
As has been suggested in the comments, measuring is going to be the most useful tool in practice, and the cost of individual instructions is heavily dependent on hardware architecture, but also the compiler.
Nevertheless, a comparison to other operations is useful, and at least AMD publishes a list of the instruction throughput for their devices in this section of their OpenCL optimisation guide, and this includes float-to-int and int-to-float conversion.
In your particular case, I strongly suspect your "vectorising" attempts will have detrimental effects. Most modern GPUs aren't SIMD processors in the CPU SIMD sense. The threads run in lock-step, but each thread operates on scalars. A "horizontal" operation like a dot product may not be particularly efficient even if the GPU does use per-thread SIMD.
If you can limit the range of each of your integers to 24 bits, a series of mad24() and mul24() calls will most likely be fastest. But again - measure. Try the different options on a range of hardware, and run them lots of times, applying basic stats to make sure you aren't just seeing random variation/overhead.
A separate thing to note with regard to integer-to-float conversions is that such conversions are often "free" when you sample as floats from an image object containing integers.

Shader programming - test if between two values

Shaders allow if(x>a && x<b) but my understanding is logical tests are slow. Is this worth optimizing along the lines:
mid = (a+b)*.5;
half = (b-a)*.5;
if(abs(x - mid) < half){...}
It seems a lot more code but from the days of CPU ASM, such tricks were sometimes worth doing.
If half and mid can be calculated once on the CPU and passed in as shader parameters, then we replace: if(x>a && x<b) with: if(abs(x - mid) < half) - is this worth bothering with?
In my tests I see a little improvement pulling half and mid into pre-calculated shader params, but I've only one GPU to test on so I'm after a "in general" answer, and if there are cases it will be dramatically better or worse.
The precalculating part of the shader params could be irrelevant (especially on directx), as the compiler is often smart enough to figure out the expression is constant.
Question is how abs is implemented. What's your target GPU? Mobile,mid,highend?