GPU Processing in vb.net - vb.net

I've got a program that takes about 24 hours to run. It's all written in VB.net and it's about 2000 lines long. It's already multi-threaded and this works perfectly (after some sweat and tears). I typically run the processes with 10 threads but I'd like to increase that to reduce processing time, which is where using the GPU comes into it. I've search google for everything related that I can think of to find some info but no luck.
What I'm hoping for is a basic example of a vb.net project that does some general operations then sends some threads to the GPU for processing. Ideally I don't want to have to pay for it. So something like:
'Do some initial processing eg.
dim x as integer
dim y as integer
dim z as integer
x=int(textbox1.text)
y=int(textbox2.text)
z=x*y
'Do some multi-threaded operations on the gpu eg.
'show some output to the user once this has finished.
Any help or links will be much appreciated. I've read plenty of articles about it in c++ and other languages but I'm rubbish at understanding other languages!
Thanks all!
Fraser

The VB.NET compiler does not compile onto the GPU, it compiles down to an intermediate language (IL) that is then just-in-time compiled (JITed) for the target architecture at runtime. Currently only x86, x64 and ARM targets are supported. CUDAfy (see below) takes the IL and translates it into CUDA C code. In turn this is compiled with NVCC to generate code that the GPU can execute. Note that this means that you are limited to NVidia GPUs as CUDA is not supported on AMD.
There are other projects that have taken the same approach, for example a Python to CUDA translator in Copperhead.
CUDAfy - A wrapper on top of the CUDA APIs with additional libraries for FFTs etc. There is also a commercial version. This does actually
CUDAfy Translator
Using SharpDevelop's decompiler ILSpy as basis the translator converts .NET code to CUDA C.
There are other projects to allow you to use GPUs from .NET languages. For example:
NMath - A set of math libraries that can be used from .NET and are GPU enabled.
There may be others but these seem to be the main ones. If you decide to use CUDAfy then you will still need to invest some time in understanding enough of CUDA and how GPUs work to port your algorithm to fit the GPU data-parallel model. Unless it is something that can be done out of the box with one of the math libraries.
It's important to realize that there is still a performance hit for accessing the GPU from a .NET language. You must pay a cost for moving (marshaling) data from the .NET managed runtime into the native runtime. The overhead here largely depends on not only the size but also the type of data and if it can be marshaled without conversion. This is in addition to the cost of moving data from the CPU to the GPU etc.

Related

Simple explanation of terms while running configure command during Tensorflow installation

I am installing tensorflow from this link.
When I run the ./configurecommand, I see following terms
XLA JIT
GDR
VERBS
OpenCL
Can somebody explain in simple language, what these terms mean and what are they used for?
XLA stands for 'Accelerated Linear Algebra'. The XLA page states that: 'XLA takes graphs ("computations") [...] and compiles them into machine instructions for various architectures." As far as I understand will this take the computation you define in tensorflow and compile it. Think of producing code in C and then running it through the C compiler for the CPU and load the resulting shared library with the code for the full computation instead of making separate calls from python to compiled functions for each part of your computation. Theano does something like this by default. JIT stands for 'just in time compiler', i.e. the graph is compiled 'on the fly'.
GDR seems to be support for exchanging data between GPUs on different servers via GPU direct. GPU direct makes it possible that e.g. the network card which receives data from another server over the network writes directly into the local GPU's memory without going through the CPU or main memory.
VERBS refers to the Infiniband Verbs application programming interface ('library'). Infiniband is a low latency network used in many supercomputers for example. This can be used when you want to run tensorflow on more than one server for communication between them. The Verbs API is to Infiniband what the Berkeley Socket API is to TCP/IP communication (although there are many more communication options and different semantics optimized for performance with Verbs).
OpenCL is a programming language suited for executing parallel computing tasks on CPU and non-CPU devices such as GPUs, with a C like syntax. With respect to C however there are certain restrictions such as no support for recursion etc. One could probably say that OpenCL is to AMD what CUDA is to NVIDIA (although also OpenCL is also used by other companies like ALTERA).

How can I speed up deep learning on a non-NVIDIA setup?

Since I only have an AMD A10-7850 APU, and do not have the funds to spend on a $800-$1200 NVIDIA graphics card, I am trying to make due with the resources I have in order to speed up deep learning via tensorflow/keras.
Initially, I used a pre-compiled version of Tensorflow. InceptionV3 would take about 1000-1200 seconds to compute 1 epoch. It has been painfully slow.
To speed up calculations, I first self-compiled Tensorflow with optimizers (using AVX, and SSE4 instructions). This lead to a roughly 40% decrease in computation times. The same computations performed above now only take about 600 seconds to compute. It's almost bearable - kind of like you can watch paint dry.
I am looking for ways to further decrease computation times. I only have an integrated AMD graphics card that is part of the APU. (How) (C/c)an I make use of this resource to speed up computation even more?
More generally, let's say there are other people with similar monetary restrictions and Intel setups. How can anyone WITHOUT discrete NVIDIA cards make use of their integrated graphics chips or otherwise non-NVIDIA setup to achieve faster than CPU-only performance? Is that possible? Why/Why not? What needs to be done to achieve this goal? Or will this be possible in the near future (2-6 months)? How?
After researching this topic for a few months, I can see 3.5 possible paths forward:
1.) Tensorflow + OpenCl as mentioned in the comments above:
There seems to be some movement going on this field. Over at Codeplay, Lukasz Iwanski just posted a comprehensive answer on how to get tensorflow to run with opencl here (I will only provide a link as stated above because the information might change there): https://www.codeplay.com/portal/03-30-17-setting-up-tensorflow-with-opencl-using-sycl
The potential to use integrated graphics is alluring. It's also worth exploring the use of this combination with APUs. But I am not sure how well this will work since OpenCl support is still early in development, and hardware support is very limited. Furthermore, OpenCl is not the same as a handcrafted library of optimized code. (UPDATE 2017-04-24: I have gotten the code to compile after running into some issues here!) Unfortunately, the hoped for speed improvements ON MY SETUP (iGPU) did not materialize.
CIFAR 10 Dataset:
Tensorflow (via pip ak unoptimized): 1700sec/epoch at 390% CPU
utilization.
Tensorflow (SSE4, AVX): 1100sec/epoch at 390% CPU
utilization.
Tensorflow (opencl + iGPU): 5800sec/epoch at 150% CPU
and 100% GPU utilization.
Your mileage may vary significantly. So I am wondering what are other people getting relatively speaking (unoptimized vs optimized vs opencl) on your setups?
What should be noted: opencl implementation means that all the heavy computation should be done on the GPU. (Updated on 2017/4/29) But in reality this is not the case yet because some functions have not been implemented yet. This leads to unnecessary copying back and forth of data between CPU and GPU ram. Again, imminent changes should improve the situation. And furthermore, for those interested in helping out and those wanting to speed things up, we can do something that will have a measurable impact on the performance of tensorflow with opencl.
But as it stands for now: 1 iGPU << 4 CPUS with SSE+AVX. Perhaps beefier GPUs with larger RAM and/or opencl 2.0 implementation could have made a larger difference.
At this point, I should add that similar efforts have been going on with at least Caffe and/or Theano + OpenCl. The limiting step in all cases appears to be the manual porting of CUDA/cuDNN functionality to the openCl paradigm.
2.) RocM + MIOpen
RocM stands for Radeon Open Compute and seems to be a hodgepodge of initiatives that is/will make deep-learning possible on non-NVIDIA (mostly Radeon devices). It includes 3 major components:
HIP: A tool that converts CUDA code to code that can be consumed by AMD GPUs.
ROCk: a 64-bit linux kernel driver for AMD CPU+GPU devices.
HCC: A C/C++ compiler that compiles code into code for a heterogeneous system architecture environment (HSA).
Apparently, RocM is designed to play to AMDs strenghts of having both CPU and GPU technology. Their approach to speeding up deep-learning make use of both components. As an APU owner, I am particularly interested in this possibility. But as a cautionary note: Kaveri APUs have limited support (only integrated graphcs is supported). Future APUs have not been released yet. And it appears, there is still a lot of work that is being done here to bring this project to a mature state. A lot of work will hopefully make this approach viable within a year given that AMD has announced their Radeon Instinct cards will be released this year (2017).
The problem here for me is that that RocM is providing tools for building deep learning libraries. They do not themselves represent deep learning libraries. As a data scientist who is not focused on tools development, I just want something that works. and am not necessarily interested in building what I want to then do the learning. There are not enough hours in the day to do both well at the company I am at.
NVIDIA has of course CUDA and cuDNN which are libaries of hand-crafted assembler code optimized for NVIDIA GPUs. All major deep learning frameworks build on top of these proprietary libraries. AMD currently does not have anything like that at all.
I am uncertain how successfully AMD will get to where NVIDIA is in this regard. But there is some light being shone on what AMDs intentions are in an article posted by Carlos Perez on 4/3/2017 here. A recent lecture at Stanford also talks in general terms about Ryzen, Vega and deep learning fit together. In essence, the article states that MIOpen will represent this hand-crafted library of optimized deep learning functions for AMD devices. This library is set to be released in H1 of 2017. I am uncertain how soon these libraries would be incorporated into the major deep learning frameworks and what the scope of functional implementation will be then at this time.
But apparently, AMD has already worked with the developers of Caffe to "hippify" the code basis. Basically, CUDA code is converted automatically to C code via HIP. The automation takes care of the vast majority of the code basis, leaving only less than 0.5% of code had to be changed and required manual attention. Compare that to the manual translation into openCl code, and one starts getting the feeling that this approach might be more sustainable. What I am not clear about is where the lower-level assembler language optimization come in.
(Update 2017-05-19) But with the imminent release of AMD Vega cards (the professional Frontier Edition card not for consumers will be first), there are hints that major deep learning frameworks will be supported through the MIOpen framework. A Forbes article released today shows the progress MiOpen has taken over just the last couple of months in terms of performance: it appears significant.
(Update 2017-08-25) MiOpen has officially been released. We are no longer talking in hypotheticals here. Now we just need to try out how well this framework works.
3.) Neon
Neon is Nervana's (now acquired by Intel) open-source deep-learning framework. The reason I mention this framework is that it seems to be fairly straightforward to use. The syntax is about as easy and intuitive as Keras. More importantly though, this framework has achieved speeds up to 2x faster than Tensorflow on some benchmarks due to some hand-crafted assembler language optimization for those computations. Potentially, cutting computation times from 500 secs/epoch down to 300 secs/epoch is nothing to sneeze at. 300 secs = 5 minutes. So one could get 15 epochs in in an hour. and about 50 epochs in about 3.5 hours! But ideally, I want to do these kinds of calculations in under an hour. To get to those levels, I need to use a GPU, and at this point, only NVIDIA offers full support in this regard: Neon also uses CUDA and cuDNN when a GPU is available (and of course, it has to be an NVIDIA GPU). If you have access other Intel hardware this is of course a valid way to pursue. Afterall, Neon was developed out of a motivation to get things to work optimally also on non-NVIDIA setups (like Nervana's custom CPUs, and now Intel FPGAs or Xeon Phis).
3.5.) Intel Movidius
Update 2017-08-25: I came across this article. Intel has released a USB3.0-stick-based "deep learning" accelerator. Apparently, it works with Cafe and allows the user perform common Cafe-based fine-tuning of networks and inference. This is important stressing: If you want to train your own network from scratch, the wording is very ambiguous here. I will therefore assume, that apart from fine-tuning a network, training itself should still be done on something with more parallel compute. The real kicker though is this: When I checked for the pricing this stick costs a mere $79. That's nothing compared to the cost of your average NVIDIA 1070-80(ti) card. If you merely want to tackle some vision problems using common network topologies already available for some related tasks, you can use this stick to fine tune it to your own use, then compile the code and put it into this stick to do inference quickly. Many use cases can be covered with this stick, and for again $79 it could be worth it. This being Intel, they are proposing to go all out on Intel. Their model is to use the cloud (i.e. Nervana Cloud) for training. Then, use this chip for prototype inference or inference where energy consumption matters. Whether this is the right approach or not is left for the reader to answer.
At this time, it looks like deep learning without NVIDIA is still difficult to realize. Some limited speed gains are difficult but potentially possible through the use of opencl. Other initiatives sound promising but it will take time to sort out the real impact that these initiatives will have.
If your platform supports opencl you can look at using it with tensorflow. There is some experimental support for it on Linux at this github repository. Some preliminary instructions are at the documentation section of of this github repository.

CUDA Expression Templates and Just in Time Compilation (JIT)

I have some questions about Just-In-Time (JIT) compilation with CUDA.
I have implemented a library based on Expression Templates according to the paper
J.M. Cohen, "Processing Device Arrays with C++ Metaprogramming", GPU Computing Gems - Jade Edition
It seems to work fairly good. If I compare the computing time of the matrix elementwise operation
D_D=A_D*B_D-sin(C_D)+3.;
with that of a purposely developed CUDA kernel, I have the following results (in parentheses, the matrix size):
time [ms] hand-written kernel: 2.05 (1024x1024) 8.16 (2048x2048) 57.4 (4096*4096)
time [ms] LIBRARY: 2.07 (1024x1024) 8.17 (2048x2048) 57.4 (4096*4096)
The library seems to need approximately the same computing time of the hand-written kernel. I'm also using the C++11 keyword auto to evaluate expressions only when they are actually needed, according to Expression templates: improving performance in evaluating expressions?. My first question is
1. Which kind of further benefit (in terms of code optimization) would JIT provide to the library? Would JIT introduce any further burdening due to runtime compilation?
It is known that a library based on Expression Templates cannot be put inside a .dll library, see for example http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/00edbe1d-4906-4d91-b710-825b503787e2. My second question is:
2. Would JIT help in hiding the implementation to a third-party user? If yes, how?
The CUDA SDK include the ptxjit example in which the ptx code is not loaded at runtime, but defined at compile time. My third question is:
3. How should I implement JIT in my case? Are there examples of JIT using PTX loaded at runtime?
Thank you very much for any help.
EDIT following Talonmies' comment
From the Cuda kernel just-in-time (jit) compilation possible? post, it reads that
cuda code can be compiled to an intermediate format ptx code, which will then be jit-compiled to the actual device architecture machine code at runtime
A doubt I have is whether the above can be applied to an Expression Templates library. I know that, due to instantiation problems, a CUDA/C++ template code cannot be compiled to a PTX. But perhaps if I instantiate all the possible combinations of Type/Operators for Unary and Binary Expressions, at least a part of the implementation can be compiled (and then masked to third-party users) to PTX which can be in turn JIT compiled to the architecture at hand.
I think you should look into OpenCL. It provides a JIT-like programming model for creating, compiling, and executing compute kernels on GPUs (all at run-time).
I take a similar, expression-template based approach in Boost.Compute which allows the library to support C++ templates and generic algorithms by translating compile-type C++ expressions into OpenCL kernel code (which is a dialect of C).
VexCL started as an expression template library for OpenCL, but since v1.0 it also supports CUDA. What it does for CUDA is exactly JIT-compilation of CUDA sources. nvcc compiler is called behind the scenes, the compiled PTX is stored in an offline cache and is loaded on subsequent launches of the program. See CUDA backend sources for how to do this. compiler.hpp should probably be of most interest for you.

How does machine code communicate with processor?

Let's take Python as an example. If I am not mistaken, when you program in it, the computer first "translates" the code to C. Then again, from C to assembly. Assembly is written in machine code. (This is just a vague idea that I have about this so correct me if I am wrong) But what's machine code written in, or, more exactly, how does the processor process its instructions, how does it "find out" what to do?
If I am not mistaken, when you program in it, the computer first "translates" the code to C.
No it doesn't. C is nothing special except that it's the most widespread programming language used for system programming.
The Python interpreter translates the Python code into so called P-Code that's executed by a virtual machine. This virtual machine is the actual interpreter which reads P-Code and every blip of P-Code makes the interpreter execute a predefined codepath. This is not very unlike how native binary machine code controls a CPU. A more modern approach is to translate the P-Code into native machine code.
The CPython interpreter itself is written in C and has been compiled into a native binary. Basically a native binary is just a long series of numbers (opcodes) where each number designates a certain operation. Some opcodes tell the machine that a defined count of numbers following it are not opcodes but parameters.
The CPU itself contains a so called instruction decoder, which reads the native binary number by number and for each opcode it reads it gives power to the circuit of the CPU that implement this particular opcode. there are opcodes, that address memory, opcodes that load data from memory into registers and so on.
how does the processor process its instructions, how does it "find out" what to do?
For every opcode, which is just a binary pattern, there is its own circuit on the CPU. If the pattern of the opcode matches the "switch" that enables this opcode, its circuit processes it.
Here's a WikiBook about it:
http://en.wikibooks.org/wiki/Microprocessor_Design
A few years ago some guy built a whole, working computer from simple function logic and memory ICs, i.e. no microcontroller or similar involved. The whole project called "Big Mess o' Wires" was more or less a CPU built from scratch. The only thing nerdier would have been building that thing from single transistors (which actually wasn't that much more difficult). He also provides a simulator which allows you to see how the CPU works internally, decoding each instruction and executing it: Big Mess o' Wires Simulator
EDIT: Ever since I originally wrote that answer, building a fully fledged CPU from modern, discrete components has been done: For your considereation a MOS6502 (the CPU that powered the Apple II, Commodore C64, NES, BBC Micro and many more) built from discetes: https://monster6502.com/
Machine-code does not "communicate with the processor".
Rather, the processor "knows how to evaluate" machine-code. In the [widespread] Von Neumann architecture this machine-code (program) can be thought of as an index-able array of where each cell contains a machine-code instruction (or data, but let's ignore that for now).
The CPU "looks" at the current instruction (often identified by the PC or Program Counter) and decides what to do (this can either be done directly with transistors/"bare-metal", or it be translated to even lower-level code): this is known as the fetch-decode-execute cycle.
When the instructions are executed side-effects occur such as setting a control flag, putting a value in a register, or jumping to a different index (changing the PC) in the program, etc. See this simple overview of a CPU which covers the above a little bit better.
It is the evaluation of each instruction -- as it is encountered -- and the interaction of side-effects that results in the operation of a traditional processor.
(Of course, modern CPUs are very complex and do lots of neat tricky things!)
That's called microcode. It's the code in the CPU that reads machine code instructions and translate that into low level data flow.
When the CPU for example encounters the add instruction, the microcode describes how it should get the two values, feed them to the ALU to do the calculation, and where to put the result.
Electricity. Circuits, memory, and logic gates.
Also, I believe Python is usually interpreted, not compiled through C → assembly → machine code.

How Do You Profile & Optimize CUDA Kernels?

I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.
There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.
How do you identify ways to make your CUDA kernels perform faster?
If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.
Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:
Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)
This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.
If you are using Windows... Check Nexus:
http://developer.nvidia.com/object/nexus.html
The CUDA profiler is rather crude and doesn't provide a lot of useful information. The only way to seriously micro-optimize your code (assuming you have already chosen the best possible algorithm) is to have a deep understanding of the GPU architecture, particularly with regard to using shared memory, external memory access patterns, register usage, thread occupancy, warps, etc.
Maybe you could post your kernel code here and get some feedback ?
The nVidia CUDA developer forum forum is also a good place to go for help with this kind of problem.
I hung back because I'm no CUDA expert, and the other answers are pretty good IF the code is already pretty near optimal. In my experience, that's a big IF, and there's no harm in verifying it.
To verify it, you need to find out if the code is for sure not doing anything it doesn't really have to do. Here are ways I can see to verify that:
Run the same code on the vanilla processor, and either take stackshots of it, or use a profiler such as Oprofile or RotateRight/Zoom that can give you equivalent information.
Running it on a CUDA processor, and doing the same thing, if possible.
What you're looking for are lines of code that have high occupancy on the call stack, as shown by the fraction of stack samples containing them. Those are your "bottlenecks". It does not take a very large number of samples to locate them.