CUDA Expression Templates and Just in Time Compilation (JIT) - dll

I have some questions about Just-In-Time (JIT) compilation with CUDA.
I have implemented a library based on Expression Templates according to the paper
J.M. Cohen, "Processing Device Arrays with C++ Metaprogramming", GPU Computing Gems - Jade Edition
It seems to work fairly good. If I compare the computing time of the matrix elementwise operation
D_D=A_D*B_D-sin(C_D)+3.;
with that of a purposely developed CUDA kernel, I have the following results (in parentheses, the matrix size):
time [ms] hand-written kernel: 2.05 (1024x1024) 8.16 (2048x2048) 57.4 (4096*4096)
time [ms] LIBRARY: 2.07 (1024x1024) 8.17 (2048x2048) 57.4 (4096*4096)
The library seems to need approximately the same computing time of the hand-written kernel. I'm also using the C++11 keyword auto to evaluate expressions only when they are actually needed, according to Expression templates: improving performance in evaluating expressions?. My first question is
1. Which kind of further benefit (in terms of code optimization) would JIT provide to the library? Would JIT introduce any further burdening due to runtime compilation?
It is known that a library based on Expression Templates cannot be put inside a .dll library, see for example http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/00edbe1d-4906-4d91-b710-825b503787e2. My second question is:
2. Would JIT help in hiding the implementation to a third-party user? If yes, how?
The CUDA SDK include the ptxjit example in which the ptx code is not loaded at runtime, but defined at compile time. My third question is:
3. How should I implement JIT in my case? Are there examples of JIT using PTX loaded at runtime?
Thank you very much for any help.
EDIT following Talonmies' comment
From the Cuda kernel just-in-time (jit) compilation possible? post, it reads that
cuda code can be compiled to an intermediate format ptx code, which will then be jit-compiled to the actual device architecture machine code at runtime
A doubt I have is whether the above can be applied to an Expression Templates library. I know that, due to instantiation problems, a CUDA/C++ template code cannot be compiled to a PTX. But perhaps if I instantiate all the possible combinations of Type/Operators for Unary and Binary Expressions, at least a part of the implementation can be compiled (and then masked to third-party users) to PTX which can be in turn JIT compiled to the architecture at hand.

I think you should look into OpenCL. It provides a JIT-like programming model for creating, compiling, and executing compute kernels on GPUs (all at run-time).
I take a similar, expression-template based approach in Boost.Compute which allows the library to support C++ templates and generic algorithms by translating compile-type C++ expressions into OpenCL kernel code (which is a dialect of C).

VexCL started as an expression template library for OpenCL, but since v1.0 it also supports CUDA. What it does for CUDA is exactly JIT-compilation of CUDA sources. nvcc compiler is called behind the scenes, the compiled PTX is stored in an offline cache and is loaded on subsequent launches of the program. See CUDA backend sources for how to do this. compiler.hpp should probably be of most interest for you.

Related

Can GraalVM combine ahead of time compilation with adaptive optimization?

As far as I know a JVM can work in different ways:
Interpreter: Runtime translation from bytecode to native code over and over again.
Just in time compilation: Compile parts of bytecode into native code at runtime when needed. Keeping the compilations. Performance overhead / penalty for compilation. Introduces possibilities for adaptive optimization at runtime which is not possible with static ahead of time compilation.
Hotspot: Only frequently executed parts get JIT compiled. The rest gets interpreted.
Now GraalVM can compile do ahead of time compilation of bytecode into native code.
Is it possible to compile bytecode ahead of time and do adaptive optimization on hotspots (in general and with GraalVM in special)?
[Clarification]
I don't mean to AOT compile parts of the bytecode to native code and leave other parts as bytecode to perform hotspot JIT compilation of them at runtime. This is what IBM's Excelsior Jet Java implementation seems to do, what I have read so far.
I mean AOT compile the whole bytecode and replacing the hotspots parts at runtime with adaptively optimized recompilations of hotspots. Which requires connecting / inserting the optimized code properly into the existing AOT compiled code.
[/clarification]
I don't know what information is needed for recompilation of hotspots with adaptive optimization at runtime . Is the bytecode needed to do it? This would mean a higher memory consumption as cost for higher performance.
I am not an expert to this, so please tell me if any assumption are wrong.
Refer to JEP 295. It mentions different AOT modes, including tiered AoT which delivers C1-compiled code with profiling instrumentation which can then optimized with C2 at runtime.

Simple explanation of terms while running configure command during Tensorflow installation

I am installing tensorflow from this link.
When I run the ./configurecommand, I see following terms
XLA JIT
GDR
VERBS
OpenCL
Can somebody explain in simple language, what these terms mean and what are they used for?
XLA stands for 'Accelerated Linear Algebra'. The XLA page states that: 'XLA takes graphs ("computations") [...] and compiles them into machine instructions for various architectures." As far as I understand will this take the computation you define in tensorflow and compile it. Think of producing code in C and then running it through the C compiler for the CPU and load the resulting shared library with the code for the full computation instead of making separate calls from python to compiled functions for each part of your computation. Theano does something like this by default. JIT stands for 'just in time compiler', i.e. the graph is compiled 'on the fly'.
GDR seems to be support for exchanging data between GPUs on different servers via GPU direct. GPU direct makes it possible that e.g. the network card which receives data from another server over the network writes directly into the local GPU's memory without going through the CPU or main memory.
VERBS refers to the Infiniband Verbs application programming interface ('library'). Infiniband is a low latency network used in many supercomputers for example. This can be used when you want to run tensorflow on more than one server for communication between them. The Verbs API is to Infiniband what the Berkeley Socket API is to TCP/IP communication (although there are many more communication options and different semantics optimized for performance with Verbs).
OpenCL is a programming language suited for executing parallel computing tasks on CPU and non-CPU devices such as GPUs, with a C like syntax. With respect to C however there are certain restrictions such as no support for recursion etc. One could probably say that OpenCL is to AMD what CUDA is to NVIDIA (although also OpenCL is also used by other companies like ALTERA).

Is it possible to customize JVM's c1 or c2 JIT compiler or the profiler used by the JIT compiler?

Is customizing JIT a viable alternative to some of the JNI use-cases?
Edit:
For the purpose of improving performance via custom hardware such as GPU or FPGA. I am aware of Project Sumatra but I believe it is GPU focus and it is no longer active. Optimization via JNI requires explicit specification at the source, while things like intrinsic in JIT (may be a more custom and complex version of intrinsic) can make optimization a bit more automatic?

Developing PTX instead of CUDA for optimization. Is it make sense? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I'm developing cuda code. But new device languages which are PTX or SPIR backends was announced. And i can come across some application which is being developed by them. At least i think we can say ptx language is enough to develop something at product level.
As we know, PTX is not real device code. It is just intermediate language for NVidia. But my question is what if i develop PTX instead of CUDA? Can i develop naturally optimized code, if i use ptx? Is it make sense?
In the other hand why/what's the motivation of PTX language?
Thanks in advance
Yes, it can make sense to implement CUDA code in PTX, just as it can make sense to implement regular CPU code in assembly instead of C++.
For instance, in CUDA C, there is no efficient way of capturing the carry flag and including it in new calculations. So it can be hard to implement efficient math operations that use more bits than what is supported natively by the machine (which is 32 bits on all current GPUs). With PTX, you can efficiently implement such operations.
I implemented a project in both CUDA C and PTX, and saw significant speedup in PTX. Of course, you will only see a speedup if your PTX code is better than the code created by the compiler from plain CUDA C.
I would recommend first creating a CUDA C version for reference. Then create a copy of the reference and start replacing parts of it with PTX, as determined by results from profiling, while making sure the results match that of the reference.
As far as the motivation for PTX, it provides an abstraction that lets NVIDIA change the native machine language between generations of GPUs without breaking backwards compatibility.
The main advantage of developing in PTX is that it can give you access to certain features which are not exposed directly in CUDA C. For instance, certain cache modifiers on load instructions, some packed SIMD operations, and predicates.
That said, I wouldn't advise anyone to code in PTX. On the CUDA Library team, we sometimes wrap PTX routines in a C function via inline assembly, and then use that. But programming in C/C++/Fortan is way easier than writing PTX.
Also, the run-time will re-compiled your PTX into an internal hardware-specific assembly language. In the process it may reorder instructions, assign registers, and change scheduling. So all of your careful ordering in PTX is mostly unnecessary and usually has little to do with the final assembly code. NVIDIA now ships at disassembler which lets you view the actual internal assembly - you can compare for yourself if you want to play around with it.

GPU Processing in vb.net

I've got a program that takes about 24 hours to run. It's all written in VB.net and it's about 2000 lines long. It's already multi-threaded and this works perfectly (after some sweat and tears). I typically run the processes with 10 threads but I'd like to increase that to reduce processing time, which is where using the GPU comes into it. I've search google for everything related that I can think of to find some info but no luck.
What I'm hoping for is a basic example of a vb.net project that does some general operations then sends some threads to the GPU for processing. Ideally I don't want to have to pay for it. So something like:
'Do some initial processing eg.
dim x as integer
dim y as integer
dim z as integer
x=int(textbox1.text)
y=int(textbox2.text)
z=x*y
'Do some multi-threaded operations on the gpu eg.
'show some output to the user once this has finished.
Any help or links will be much appreciated. I've read plenty of articles about it in c++ and other languages but I'm rubbish at understanding other languages!
Thanks all!
Fraser
The VB.NET compiler does not compile onto the GPU, it compiles down to an intermediate language (IL) that is then just-in-time compiled (JITed) for the target architecture at runtime. Currently only x86, x64 and ARM targets are supported. CUDAfy (see below) takes the IL and translates it into CUDA C code. In turn this is compiled with NVCC to generate code that the GPU can execute. Note that this means that you are limited to NVidia GPUs as CUDA is not supported on AMD.
There are other projects that have taken the same approach, for example a Python to CUDA translator in Copperhead.
CUDAfy - A wrapper on top of the CUDA APIs with additional libraries for FFTs etc. There is also a commercial version. This does actually
CUDAfy Translator
Using SharpDevelop's decompiler ILSpy as basis the translator converts .NET code to CUDA C.
There are other projects to allow you to use GPUs from .NET languages. For example:
NMath - A set of math libraries that can be used from .NET and are GPU enabled.
There may be others but these seem to be the main ones. If you decide to use CUDAfy then you will still need to invest some time in understanding enough of CUDA and how GPUs work to port your algorithm to fit the GPU data-parallel model. Unless it is something that can be done out of the box with one of the math libraries.
It's important to realize that there is still a performance hit for accessing the GPU from a .NET language. You must pay a cost for moving (marshaling) data from the .NET managed runtime into the native runtime. The overhead here largely depends on not only the size but also the type of data and if it can be marshaled without conversion. This is in addition to the cost of moving data from the CPU to the GPU etc.