cupy jit kernal vs numba cuda - cupy

(I'm sending attachement as a google drive link at "https://drive.google.com/file/d/1yzXfu5ZdY5ByxfYwTQLrQWgzTZrsww8u/view?usp=sharing" because it is big)
I'm comparing the results of the attached cupy implementation file, "test_cupy_jit_kernel.py" (using jit.rawkernel) to a numba cuda implementation file, "test_numba_cuda.py" (using cuda.jit kernel). The functions in both files are almost the same (except for using cupy.absolute instead of the simple python abs function). There is a huge difference in results as shown in the attached result files ("init_paw_cupy_jit.npy", and "init_paw_numba.npy"). I don't know why.
So, I tried a simple cupy implementation (without kernels) in the attached "test_cupy_simple.py" file. The results (the attached "init_paw_numba.npy") are very close to the numba code. But it is still different from the cupy jit kernel implementation. Despite this simple cupy implementation works, it takes a longer time than numba, and cupy jit kernel.
I don't know how to fix the different results.
BTW, I also tried fused kernel, but I recieved different and strange errors, so I gave up on fused kernel.
Regards

Related

MATLAB calling cuda GPU code without kernel functions

As the title indicates. I have C-mex code, which I can successfully call from MATLAB. Now I simply change the suffix from *.cpp into *.cu, and compile with command
mexcuda updateElectricFields.cu
Compiling and running is successful and yielded the correct results. My question is, is this *.cu file executed on GPU? is the GPU trying to access CPU memory causing slower performance than CPU?

TensorFlow operations with GPU support

Is there a way (or maybe a list?) to determine all the tensorflow operations which offer GPU support?
Right now, it is a trial and error process for me - I try to place a group of operations on GPU. If it works, sweet. If not, try somewhere else.
This is the only thing relevant (but not helpful) I have found so far: https://github.com/tensorflow/tensorflow/issues/2502
Unfortunately, it seems there's no built-in way to get an exact list or even check this for a specific op. As mentioned in the issue above, the dispatching is done in the native C++ code: a particular operation can be assigned to GPU if a corresponding kernel has been registered to DEVICE_GPU.
I think the easiest way for you is to grep "REGISTER_KERNEL_BUILDER" -r tensorflow the tensorflow source base to get a list of matched operations, which will look something like this.
But remember that even with REGISTER_KERNEL_BUILDER specification, there's no guarantee an op will be performed on a GPU. For example, 32-bit int Add is assigned on CPU regardless of the existing kernel.

CUDA Expression Templates and Just in Time Compilation (JIT)

I have some questions about Just-In-Time (JIT) compilation with CUDA.
I have implemented a library based on Expression Templates according to the paper
J.M. Cohen, "Processing Device Arrays with C++ Metaprogramming", GPU Computing Gems - Jade Edition
It seems to work fairly good. If I compare the computing time of the matrix elementwise operation
D_D=A_D*B_D-sin(C_D)+3.;
with that of a purposely developed CUDA kernel, I have the following results (in parentheses, the matrix size):
time [ms] hand-written kernel: 2.05 (1024x1024) 8.16 (2048x2048) 57.4 (4096*4096)
time [ms] LIBRARY: 2.07 (1024x1024) 8.17 (2048x2048) 57.4 (4096*4096)
The library seems to need approximately the same computing time of the hand-written kernel. I'm also using the C++11 keyword auto to evaluate expressions only when they are actually needed, according to Expression templates: improving performance in evaluating expressions?. My first question is
1. Which kind of further benefit (in terms of code optimization) would JIT provide to the library? Would JIT introduce any further burdening due to runtime compilation?
It is known that a library based on Expression Templates cannot be put inside a .dll library, see for example http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/00edbe1d-4906-4d91-b710-825b503787e2. My second question is:
2. Would JIT help in hiding the implementation to a third-party user? If yes, how?
The CUDA SDK include the ptxjit example in which the ptx code is not loaded at runtime, but defined at compile time. My third question is:
3. How should I implement JIT in my case? Are there examples of JIT using PTX loaded at runtime?
Thank you very much for any help.
EDIT following Talonmies' comment
From the Cuda kernel just-in-time (jit) compilation possible? post, it reads that
cuda code can be compiled to an intermediate format ptx code, which will then be jit-compiled to the actual device architecture machine code at runtime
A doubt I have is whether the above can be applied to an Expression Templates library. I know that, due to instantiation problems, a CUDA/C++ template code cannot be compiled to a PTX. But perhaps if I instantiate all the possible combinations of Type/Operators for Unary and Binary Expressions, at least a part of the implementation can be compiled (and then masked to third-party users) to PTX which can be in turn JIT compiled to the architecture at hand.
I think you should look into OpenCL. It provides a JIT-like programming model for creating, compiling, and executing compute kernels on GPUs (all at run-time).
I take a similar, expression-template based approach in Boost.Compute which allows the library to support C++ templates and generic algorithms by translating compile-type C++ expressions into OpenCL kernel code (which is a dialect of C).
VexCL started as an expression template library for OpenCL, but since v1.0 it also supports CUDA. What it does for CUDA is exactly JIT-compilation of CUDA sources. nvcc compiler is called behind the scenes, the compiled PTX is stored in an offline cache and is loaded on subsequent launches of the program. See CUDA backend sources for how to do this. compiler.hpp should probably be of most interest for you.

How to recompile Numpy with enabled OpenMP directives

In this answer to Multiprocessing.Pool makes Numpy matrix multiplication slower the author of the answer recommends in the second paragraph to recompile Numpy with enabled OpenMP directives.
So my questions are:
How do you do that?
What could be negative side effects?
Would you recommend that?
Searching SO I found following post OpenMP and Python, where the answers explain why there is no use for OpenMP in general Python due to the GIL. But I assume Numpy is a different issue.
While Python code itself hardly benefits from running in parallel, NumPy is not written in Python. It is in fact a pythonistic wrapper around some very well established numerical computational libraries and other numerical algorithms, both implemented in compiled languages like Fortran and C. Some of these libraries already come in parallel multithreaded versions (like Intel MKL and ATLAS, when used to provide BLAS and LAPACK implementations in NumPy).
The idea is that in NumPy programs the Python code should only be used to drive the computations, while all the heavy lifting should be done in the C or Fortran backends. If your program doesn't spend most of its run time inside NumPy routines, then Amdahl's law will prevent you from getting a reasonable speed-up with parallel NumPy.
In order to get NumPy to support OpenMP, you must have an OpenMP-enabled C compiler. Most C compilers nowadays support OpenMP and this includes GCC, Intel C Compiler, Oracle C Compiler, and even Microsoft Visual C Compiler (although it is stuck at an ancient OpenMP version). Read the Installation Manual for detailed instructions.

GPU Processing in vb.net

I've got a program that takes about 24 hours to run. It's all written in VB.net and it's about 2000 lines long. It's already multi-threaded and this works perfectly (after some sweat and tears). I typically run the processes with 10 threads but I'd like to increase that to reduce processing time, which is where using the GPU comes into it. I've search google for everything related that I can think of to find some info but no luck.
What I'm hoping for is a basic example of a vb.net project that does some general operations then sends some threads to the GPU for processing. Ideally I don't want to have to pay for it. So something like:
'Do some initial processing eg.
dim x as integer
dim y as integer
dim z as integer
x=int(textbox1.text)
y=int(textbox2.text)
z=x*y
'Do some multi-threaded operations on the gpu eg.
'show some output to the user once this has finished.
Any help or links will be much appreciated. I've read plenty of articles about it in c++ and other languages but I'm rubbish at understanding other languages!
Thanks all!
Fraser
The VB.NET compiler does not compile onto the GPU, it compiles down to an intermediate language (IL) that is then just-in-time compiled (JITed) for the target architecture at runtime. Currently only x86, x64 and ARM targets are supported. CUDAfy (see below) takes the IL and translates it into CUDA C code. In turn this is compiled with NVCC to generate code that the GPU can execute. Note that this means that you are limited to NVidia GPUs as CUDA is not supported on AMD.
There are other projects that have taken the same approach, for example a Python to CUDA translator in Copperhead.
CUDAfy - A wrapper on top of the CUDA APIs with additional libraries for FFTs etc. There is also a commercial version. This does actually
CUDAfy Translator
Using SharpDevelop's decompiler ILSpy as basis the translator converts .NET code to CUDA C.
There are other projects to allow you to use GPUs from .NET languages. For example:
NMath - A set of math libraries that can be used from .NET and are GPU enabled.
There may be others but these seem to be the main ones. If you decide to use CUDAfy then you will still need to invest some time in understanding enough of CUDA and how GPUs work to port your algorithm to fit the GPU data-parallel model. Unless it is something that can be done out of the box with one of the math libraries.
It's important to realize that there is still a performance hit for accessing the GPU from a .NET language. You must pay a cost for moving (marshaling) data from the .NET managed runtime into the native runtime. The overhead here largely depends on not only the size but also the type of data and if it can be marshaled without conversion. This is in addition to the cost of moving data from the CPU to the GPU etc.