Get BLAS implementation information from CMake - cmake

In a project I am using
find_package(BLAS REQUIRED)
to detect BLAS.
Is there a way to tell which implementation of BLAS was found after this?
According to the documentation BLA_VENDOR can be used to require a certain implementation, but it doesn't report which one was found.
Unfortunatelly I need to know which BLAS was found because different implementations have subtle differences in their interfaces, for example MKL uses zdotu with 6 arguments rather than 5 (first is the pointer to the result value).

Per #Tsyvarev suggestion I ended up doing this
set(BLA_VENDOR Intel10_64lp)
find_package(BLAS)
if(BLAS_FOUND)
message("MKL environment detected")
add_definitions(-DRETURN_BY_STACK)
else()
unset(BLA_VENDOR)
find_package(BLAS REQUIRED)
endif()
It seems that -DRETURN_BY_STACK (or -DFORTRAN_COMPLEX_FUNCTIONS_RETURN_VOID) is a way recognized by some BLAS headers (e.g. cblas.h) to have MKL-compatible declarations.
Technically one might need to try with every variant of intel mkl BLAS, and choose one somehow, I am putting it here for completeness.
Intel10_32 (intel mkl v10 32 bit)
Intel10_64lp (intel mkl v10+ 64 bit, threaded code, lp64 model)
Intel10_64lp_seq (intel mkl v10+ 64 bit, sequential code, lp64 model)
Intel10_64ilp (intel mkl v10+ 64 bit, threaded code, ilp64 model)
Intel10_64ilp_seq (intel mkl v10+ 64 bit, sequential code, ilp64 model)
Intel

Related

Is it safer to use OpenCL rather than SYCL when the objective is to have the most hardware-compatible program?

My objective is to obtain the ability of parallelizing a code in order to be able to run it on GPU, and the Graal would be to have a software that can run in parallel on any GPU or even CPU (Intel, NVIDIA, AMD, and so...).
From what I understood, the best solution would be to use OpenCL. But shortly after that, I also read about SYCL, that is supposed to simplify the codes that run on GPU.
But is it just that ? Isn't better to use a lower level language in order to be sure that it will be possible to be used in the most hardware possible ?
I know that all the compatibilities are listed on The Khronos Group website, but I am reading everything and its opposite on the Internet (like if a NVIDIA card supports CUDA, then it supports OpenCL, or NVIDIA cards will never work with OpenCL, even though OpenCL is supposed to work with everything)...
This is a new topic to me and there are lots of informations on the Internet... It would be great if someone could give me a simple answer to this question.
Probably yes.
OpenCL is supported on all AMD/Nvidia/Intel GPUs and on all Intel CPUs since around 2009. For best compatibility with almost any device, use OpenCL 1.2. The nice thing is that the OpenCL Runtime is included in the graphics drivers, so you don't have to install anything extra to work with it or to get it working on another machine.
SYCL on the other hand is newer and not yet established that well. For example, it is not officially supported (yet) on Nvidia GPUs: https://forums.developer.nvidia.com/t/is-sycl-available-on-cuda/37717/7
But there are already SYCL implememtations that are compatible with Nvidia/AMD GPUs, essentially built on top of CUDA or OpenCL 1.2, see here: https://stackoverflow.com/a/63468372/9178992

Disabling AVX2 in CPU for testing purposes

I've got an application that requires AVX2 to work correctly. A check was implemented to check during application start if CPU has AVX2 instruction. I would like to check if it works correctly, but i only have CPU that has AVX2. Is there a way to temporarly turn it off for testing purposes? Or to somehow emulate other CPU?
Yes, use an "emulation" (or dynamic recompilation) layer like Intel's Software Development Emulator (SDE), or maybe QEMU.
SDE is closed-source freeware, and very handy for both testing AVX512 code on old CPUs, or for simulating old CPUs to check that you don't accidentally execute instructions that are too new.
Example: I happened to have a binary that unconditionally uses an AVX2 vpmovzxwq load instruction (for a function I was testing). It runs fine on my Skylake CPU natively, but SDE has a -snb option to emulate a Sandybridge in both CPUID and actually checking every instruction.
$ sde64 -snb -- ./mask
TID 0 SDE-ERROR: Executed instruction not valid for specified chip (SANDYBRIDGE): 0x401005: vpmovzxwq ymm2, qword ptr [rip+0xff2]
Image: /tmp/mask+0x5 (in multi-region image, region# 1)
Instruction bytes are: c4 e2 7d 34 15 f2 0f 00 00
There are options to emulate CPUs as old as -quark, -p4 (SSE2), or Core 2 Merom (-mrm), to as new as IceLake-Server (-icx) or Tremont (-tnt). (And Xeon Phi CPUs like KNL and KNM.)
It runs pretty quickly, using dynamic recompilation (JIT) so code using only instructions that are supported natively can run at basically native speed, I think.
It also has instrumentation options (like -mix to dump the instruction mix), and options to control the JIT more closely. I think you could maybe get it to not report AVX2 in CPUID, but still let AVX2 instructions run without faulting.
Or probably emulate a CPU that supports AVX2 but not FMA (there is a real CPU like this from Via, unfortunately). Or combinations that no real CPU has, like AVX2 but not popcnt, or BMI1/BMI2 but not AVX. But I haven't looked into how to do that.
The basic sde -help options only let you set it to specific Intel CPUs, and for checking for potentially-slow SSE/AVX transitions (without correct vzeroupper usage). And a few other things.
One important test-case that SDE is missing is AVX+FMA without AVX2 (AMD Piledriver / Steamroller, i.e. most AMD FX-series CPUs). It's easy to forget and use an AVX2 shuffle in code that's supposed to be AVX1+FMA3, and some compilers (like MSVC) won't catch this at compile time the way gcc -march=bdver2 would. (Bulldozer only has AVX + FMA4, not FMA3, because Intel changed their plans after it was too late for AMD to redesign.)
If you just want CPUID not report the presence of AVX2 (and FMA?) so your code uses its AVX1 or non-AVX versions of functions, you can do that with most VMs.
For AVX instructions to run without faulting, a bit in a control register has to be set. (So this works like a promise by the OS that it will correctly save/restore the new architectural state of YMM upper halves). So disabling AVX in CPUID will give you a VM instance where AVX instructions fault. (At least 256-bit instructions? I haven't tried this to see if 128-bit AVX instructions can still execute in this state on HW that supports AVX.)

CUDA Expression Templates and Just in Time Compilation (JIT)

I have some questions about Just-In-Time (JIT) compilation with CUDA.
I have implemented a library based on Expression Templates according to the paper
J.M. Cohen, "Processing Device Arrays with C++ Metaprogramming", GPU Computing Gems - Jade Edition
It seems to work fairly good. If I compare the computing time of the matrix elementwise operation
D_D=A_D*B_D-sin(C_D)+3.;
with that of a purposely developed CUDA kernel, I have the following results (in parentheses, the matrix size):
time [ms] hand-written kernel: 2.05 (1024x1024) 8.16 (2048x2048) 57.4 (4096*4096)
time [ms] LIBRARY: 2.07 (1024x1024) 8.17 (2048x2048) 57.4 (4096*4096)
The library seems to need approximately the same computing time of the hand-written kernel. I'm also using the C++11 keyword auto to evaluate expressions only when they are actually needed, according to Expression templates: improving performance in evaluating expressions?. My first question is
1. Which kind of further benefit (in terms of code optimization) would JIT provide to the library? Would JIT introduce any further burdening due to runtime compilation?
It is known that a library based on Expression Templates cannot be put inside a .dll library, see for example http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/00edbe1d-4906-4d91-b710-825b503787e2. My second question is:
2. Would JIT help in hiding the implementation to a third-party user? If yes, how?
The CUDA SDK include the ptxjit example in which the ptx code is not loaded at runtime, but defined at compile time. My third question is:
3. How should I implement JIT in my case? Are there examples of JIT using PTX loaded at runtime?
Thank you very much for any help.
EDIT following Talonmies' comment
From the Cuda kernel just-in-time (jit) compilation possible? post, it reads that
cuda code can be compiled to an intermediate format ptx code, which will then be jit-compiled to the actual device architecture machine code at runtime
A doubt I have is whether the above can be applied to an Expression Templates library. I know that, due to instantiation problems, a CUDA/C++ template code cannot be compiled to a PTX. But perhaps if I instantiate all the possible combinations of Type/Operators for Unary and Binary Expressions, at least a part of the implementation can be compiled (and then masked to third-party users) to PTX which can be in turn JIT compiled to the architecture at hand.
I think you should look into OpenCL. It provides a JIT-like programming model for creating, compiling, and executing compute kernels on GPUs (all at run-time).
I take a similar, expression-template based approach in Boost.Compute which allows the library to support C++ templates and generic algorithms by translating compile-type C++ expressions into OpenCL kernel code (which is a dialect of C).
VexCL started as an expression template library for OpenCL, but since v1.0 it also supports CUDA. What it does for CUDA is exactly JIT-compilation of CUDA sources. nvcc compiler is called behind the scenes, the compiled PTX is stored in an offline cache and is loaded on subsequent launches of the program. See CUDA backend sources for how to do this. compiler.hpp should probably be of most interest for you.

How to recompile Numpy with enabled OpenMP directives

In this answer to Multiprocessing.Pool makes Numpy matrix multiplication slower the author of the answer recommends in the second paragraph to recompile Numpy with enabled OpenMP directives.
So my questions are:
How do you do that?
What could be negative side effects?
Would you recommend that?
Searching SO I found following post OpenMP and Python, where the answers explain why there is no use for OpenMP in general Python due to the GIL. But I assume Numpy is a different issue.
While Python code itself hardly benefits from running in parallel, NumPy is not written in Python. It is in fact a pythonistic wrapper around some very well established numerical computational libraries and other numerical algorithms, both implemented in compiled languages like Fortran and C. Some of these libraries already come in parallel multithreaded versions (like Intel MKL and ATLAS, when used to provide BLAS and LAPACK implementations in NumPy).
The idea is that in NumPy programs the Python code should only be used to drive the computations, while all the heavy lifting should be done in the C or Fortran backends. If your program doesn't spend most of its run time inside NumPy routines, then Amdahl's law will prevent you from getting a reasonable speed-up with parallel NumPy.
In order to get NumPy to support OpenMP, you must have an OpenMP-enabled C compiler. Most C compilers nowadays support OpenMP and this includes GCC, Intel C Compiler, Oracle C Compiler, and even Microsoft Visual C Compiler (although it is stuck at an ancient OpenMP version). Read the Installation Manual for detailed instructions.

GPU Processing in vb.net

I've got a program that takes about 24 hours to run. It's all written in VB.net and it's about 2000 lines long. It's already multi-threaded and this works perfectly (after some sweat and tears). I typically run the processes with 10 threads but I'd like to increase that to reduce processing time, which is where using the GPU comes into it. I've search google for everything related that I can think of to find some info but no luck.
What I'm hoping for is a basic example of a vb.net project that does some general operations then sends some threads to the GPU for processing. Ideally I don't want to have to pay for it. So something like:
'Do some initial processing eg.
dim x as integer
dim y as integer
dim z as integer
x=int(textbox1.text)
y=int(textbox2.text)
z=x*y
'Do some multi-threaded operations on the gpu eg.
'show some output to the user once this has finished.
Any help or links will be much appreciated. I've read plenty of articles about it in c++ and other languages but I'm rubbish at understanding other languages!
Thanks all!
Fraser
The VB.NET compiler does not compile onto the GPU, it compiles down to an intermediate language (IL) that is then just-in-time compiled (JITed) for the target architecture at runtime. Currently only x86, x64 and ARM targets are supported. CUDAfy (see below) takes the IL and translates it into CUDA C code. In turn this is compiled with NVCC to generate code that the GPU can execute. Note that this means that you are limited to NVidia GPUs as CUDA is not supported on AMD.
There are other projects that have taken the same approach, for example a Python to CUDA translator in Copperhead.
CUDAfy - A wrapper on top of the CUDA APIs with additional libraries for FFTs etc. There is also a commercial version. This does actually
CUDAfy Translator
Using SharpDevelop's decompiler ILSpy as basis the translator converts .NET code to CUDA C.
There are other projects to allow you to use GPUs from .NET languages. For example:
NMath - A set of math libraries that can be used from .NET and are GPU enabled.
There may be others but these seem to be the main ones. If you decide to use CUDAfy then you will still need to invest some time in understanding enough of CUDA and how GPUs work to port your algorithm to fit the GPU data-parallel model. Unless it is something that can be done out of the box with one of the math libraries.
It's important to realize that there is still a performance hit for accessing the GPU from a .NET language. You must pay a cost for moving (marshaling) data from the .NET managed runtime into the native runtime. The overhead here largely depends on not only the size but also the type of data and if it can be marshaled without conversion. This is in addition to the cost of moving data from the CPU to the GPU etc.