How to write code which performs calculations using the GPU? - gpu

We hear a lot about how certain types of calculation can be completed much more quickly by a GPU than by a CPU, but as a programmer I would have no idea how to force a calculation to be run in this way. Can anyone give a high-level explanation of how this is done?
I am aware that there are libraries which will do this 'by magic' but I would like to understand what they are doing behind the scenes. I'm pretty sure there isn't a runOnGpu flag that you can pass to low-level system calls, so what techniques are available?

Need to rewrite your program. Use C++ AMP if you C++ or APARAPI if you Java.

Related

Let the GPU handle recursive algorithm

I have a complex recursive algorithm that in it's php implementation runs about 15 minutes in the CLI to complete. I was thinking about porting it to objective-c and wanted to know who I can make use of the the GPU for the calculations. Is there a way to designate threads to be executed by the GPU?
Thanks
Yes, it's possible to use the GPU for calculations, although depending on the task it may not be advantageous. Without posting code it's anyone's guess what the most efficient means for your implementation might be. I would recommend reading the "Concurrency Programming Guide", for it's an excellent starting point in terms understanding the appropriate ways one might want to handle concurrent threading within Objective-C.

Is it possible to optimize a compiled binary?

This is more of a curiosity I suppose, but I was wondering whether it is possible to apply compiler optimizations post-compilation. Are most optimization techniques highly-dependent on the IR, or can assembly be translated back and forth fairly easily?
This has been done, though I don't know of many standard tools that do it.
This paper describes an optimizer for Compaq Alpha processors that works after linking has already been done and some of the challenges they faced in writing it.
If you strain the definition a bit, you can use profile-guided optimization to instrument a binary and then rewrite it based on its observable behaviors with regards to cache misses, page faults, etc.
There's also been some work in dynamic translation, in which you run an existing binary in an interpreter and use standard dynamic compilation techniques to try to speed this up. Here's one paper that details this.
Hope this helps!
There's been some recent research interest in this space. Alex Aiken's STOKE project is doing exactly this with some pretty impressive results. In one example, their optimizer found a function that is twice as fast as gcc -O3 for the Montgomery Multiplication step in OpenSSL's RSA library. It applies these optimizations to already-compiled ELF binaries.
Here is a link to the paper.
Some compiler backends have a peephole optimizer which basically does just that, before it commits to the assembly that represents the IR, it has a little opportunity to optimize.
Basically you would want to do the same thing, from the binary, machine code to machine code. Not the same tool but the same kind of process, examine some size block of code and optimize.
Now the problem you will end up with though is for example you may have had some variables that were marked volatile in C so they are being very inefficiently used in the binary, the optimizer wont know the programmers desire there and could end up optimizing that out.
You could certainly take this back to IR and forward again, nothing to stop you from that.

What's the difference between code written for a desktop machine and a supercomputer?

Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
My impression is that it has to do with taking advantage of certain libraries (e.g., BLAS, LAPLACK) where possible, revising algorithms (reducing iteration), profiling, parallelizing, considering memory-hard disk-processor use/access... I am aware of the adage, "want to optimize your code? don't do it", but if one were interested in learning about writing efficient code, what references might be available?
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
Profiling is a must at any level of machinery. In common usage, I've found that scaling to larger and larger grids requires a better understanding of the grid software and the topology of the grid. In that sense, everything you learn about optimizing for one machine is still applicable, but understanding the grid software gets you additional mileage. Hadoop is one of the most popular and widespread grid systems, so learning about the scheduler options, interfaces (APIs and web interfaces), and other aspects of usage will help. Although you may not use Hadoop for a given supercomputer, it is one of the less painful methods for learning about distributed computing. For parallel computing, you may pursue MPI and other systems.
Additionally, learning to parallelize code on a single machine, across multiple cores or processors, is something you can begin learning on a desktop machine.
Recommendations:
Learn to optimize code on a single machine:
Learn profiling
Learn to use optimized libraries (after profiling: so that you see the speedup)
Be sure you know algorithms and data structures very well (*)
Learn to do embarrassingly parallel programming on multiple core machines.
Later: consider multithreaded programming. It's harder and may not pay off for your problem.
Learn about basic grid software for distributed processing
Learn about tools for parallel processing on a grid
Learn to program for alternative hardware, e.g. GPUs, various specialized computing systems.
This is language agnostic. I have had to learn the same sequence in multiple languages and multiple HPC systems. At each step, take a simpler route to learn some of the infrastructure and tools; e.g. learn multicore before multithreaded, distributed before parallel, so that you can see what fits for the hardware and problem, and what doesn't.
Some of the steps may be reordered depending on local computing practices, established codebases, and mentors. If you have a large GPU or MPI library in place, then, by all means, learn that rather than foist Hadoop onto your collaborators.
(*) The reason to know algorithms very well is that as soon as your code is running on a grid, others will see it. When it is hogging up the system, they will want to know what you're doing. If you are running a process that is polynomial and should be constant, you may find yourself mocked. Others with more domain expertise may help you find good approximations for NP-hard problems, but you should know that the concept exists.
Parallelization would be the key.
Since the problems you cited (e.g. CFD, multiphysics, mass transfer) are generally expressed as large-scale linear algebra problems, you need matrix routines that parallelize well. MPI is a standard for those types of problems.
Physics can influence as well. For example, it's possible to solve some elliptical problems efficiently using explicit dynamics and artificial mass and damping matricies.
3D multiphysics means coupled differential equations with varying time scales. You'll want a fine mesh to resolve details in both space and time, so the number of degrees of freedom will rise rapidly; time steps will be governed by the stability requirements of your problem.
If someone ever figures out how to run linear algebra as a map-reduce problem they'll have it knocked.
Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
First, you would need to understand the problem. Not all problems can be solved in parallel (and I'm using the term parallel in as wide meaning as it can get). So, see how the problem is now solved. Can it be solved with some other method quicker. Can it be divided in independent parts ... and so on ...
Fortran is the language specialized for scientific computing, and during the recent years, along with the development of new language features, there has also been some very interesting development in terms of features that are aiming for this "market". The term "co-arrays" could be an interesting read.
But for now, I would suggest reading first into a book like Using OpenMP - OpenMP is a simpler model but the book (fortran examples inside) explains nicely the fundamentals. Message parsing interface (for friends, MPI :) is a larger model, and one of often used. Your next step from OpenMP should probably go in this direction. Books on the MPI programming are not rare.
You mentioned also libraries - yes, some of those you mentioned are widely used. Others are also available. A person who does not know exactly where the problem in performance lies should IMHO never try to undertake the task of trying to rewrite library routines.
Also there are books on parallel algorithms, you might want to check out.
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
In short it comes down to understanding the problem, learning where the problem in performance is, re-solving the whole problem again with a different approach, iterating a few times, and by that time you'll already know what you're doing and where you're stuck.
We're in a position similar to yours.
I'm most in agreement with #Iterator's answer, but I think there's more to say.
First of all, I believe in "profiling" by the random-pausing method, because I'm not really interested in measuring things (It's easy enough to do that) but in pinpointing code that is causing time waste, so I can fix it. It's like the difference between a floodlight and a laser.
For one example, we use LAPACK and BLAS. Now, in taking my stack samples, I saw a lot of the samples were in the routine that compares characters. This was called from a general routine that multiplies and scales matrices, and that was called from our code. The matrix-manipulating routine, in order to be flexible, has character arguments that tell it things like, if a matrix is lower-triangular or whatever. In fact, if the matrices are not very large, the routine can spend more than 50% of its time just classifying the problem. Of course, the next time it is called from the same place, it does the same thing all over again. In a case like that, a special routine should be written. When it is optimized by the compiler, it will be as fast as it reasonably can be, and will save all that classifying time.
For another example, we use a variety of ODE solvers. These are optimized to the nth degree of course. They work by calling user-provided routines to calculate derivatives and possibly a jacobian matrix. If those user-provided routines don't actually do much, samples will indeed show the program counter in the ODE solver itself. However, if the user-provided routines do much more, samples will find the lower end of the stack in those routines mostly, because they take longer, while the ODE code takes roughly the same time. So, optimization should be concentrated in the user-provided routines, not the ODE code.
Once you've done several of the kind of optimization that is pinpointed by stack sampling, which can speed things up by 1-2 orders of magnitude, then by all means exploit parallelism, MPI, etc. if the problem allows it.

How fast is VB .net compared to native code for arithmetic?

I need to write software that will do a lot of math. Mostly it will be matrix multiplication with integers to compute DCT. How much faster should I expect the code to run in native c as compared to VB .Net? Factor of 2, factor of 10, factor of 1000...? Has someone tried and collected statistics on this?
.Net code is JIT-compiled to native code before execution, so it should not be slower than native code in general. I'd expect a factor < 10.
Moreover, adaptive optimization techniques profile the code as it runs, gaining more information than a typical static compiler. So, the JIT can make more informed decisions for further optimizations
The .NET code is compiled into native code by the JIT compiler, so you get native code in both cases.
The difference is that the C code has somewhat less overhead around the calculations, so you should perhaps expect a performace difference of factor 2.
VB is 93.7% as fast as C. If you pick the right scenario.
Actually, if your 'native C' includes regular calls to malloc() and free(), any kind of Gargage Collected language like VB.Net is going to literally run circles around it. GC can be 10x faster than mallocs in your inner loops.
If you break down and use C, try to reuse structures that you declared just once instead of making new ones, to avoid this problem. This may be of benefit even in VB if your solution lends itself to it. However it will be harder to program and GC is very fast.
As far as bounds/overflow checks, if speed is important and testing has revealed they don't happen, and you're not risking life or millions from an error or abend, they are a waste of time. But if you can't get rid of them, your time is likely still more valuable in a language with which you can program more quickly.
If you expect serious size and usage, it pays to split the task with a controlling program and store the allocated 'task definitions' into a shared directory with a file per task solver, or a database. Then you can run a solver per processor (2 per HT CPU), or network computers. Be weary of queue structures - it's tough to atomicly 'Mark-Taken-And-Get-Data-If-Not-Taken'. You know how many task solvers you're going to start. I did this with an imaging utility I develop, it was much easier than expected, and it creamed the previous version. Plus if you use multiple processes with a properly dividable problem domain, you avoid the slight-to-significant programming burden of multithreading. Or convincing your coworkers that your culrly braces are in the right place. Peace.

How to optimize MATLAB loops?

I have been working lately on a number of iterative algorithms in MATLAB, and been getting hit hard by MATLAB's performance (or lack thereof) when it comes to loops. I'm aware of the benefit of vectorizing code when possible, but are there any tools for optimization when you need the loop for your algorithm?
I am aware of the MEX-file option to write small subroutines in C/C++, although given my algorithms, this can be a very painful option given the data structures required. I mainly use MATLAB for the simplicity and speed of prototyping, so a syntactically complex, statically typed language is not ideal for my situation.
Are there any other suggestions? Even other languages (python?) which have relatively painless matrix tools are an option.
It was once true that vectorization would improve the speed of your MATLAB code. However, that is largely no longer true with the JIT-accelerator
This video demonstrating the MATLAB profiler might help.
PROFILER is very useful tool to find bottlenecks in Matlab code. it does not change your code of course, but helps to find which functions/lines to optimize with vectorization or mex.
http://www.mathworks.com/access/helpdesk/help/techdoc/ref/profile.html
http://www.mathworks.com/access/helpdesk/help/techdoc/matlab_env/f9-17018.html
If you have a choice, be sure to set up your loops so you scan the data column-wise which is how the data in MATLAB are arranged. In addition, be sure to preallocate any output arrays before the loop and index into them instead of growing the array inside the for-loop.
If you can cast your code so your operations are called on the whole matrix then you will see great improvement in the speed of your code. Many functions are much quicker when operating on the whole matrix rather than in an element-wise fashion with loops.
You might want to investigate MATLAB's Parallel Computing Toolbox which can make a big difference if you have the right hardware. I re-wrote about 12 lines of code and got 4 - 6 times speedup for one of our loop-intensive programs on and eight core PC.