XLA on CPU -- where do the gains come from?

XLA on CPU -- where do the gains come from? - gpu

I understand that XLA performs automatic kernel fusion for a computational graph, which comes handy in reducing memory bandwidth usage on a GPU. What gains can one derive using XLA for a CPU? Is it the same principle, in fusing computations and not writing intermediate results to the L1 cache? I would appreciate a laymen's explanation.

Yes, basically it's what you said.
In general, the more information (or "context") you, as a compiler, have about a set of computations, the better you can optimize them.
As pointed out in the XLA page, the single most important feature of XLA is fusion.
Instead of computing x + y*z as two separate operations, it can be computed as single fused-multiply-add operation.
This is not only faster (generally) but it also avoids intermediate results which may have smaller precision and need to be stored somewhere.
Probably the TensorFlow model works by taking a set of data from memory and performing one of a defined set of kernels on it, storing each partial result back in memory, so the next kernel can consume it.
With XLA, linear algebra patterns are recognized and further optimized by combining one or more kernels together, avoiding an unnecessary back and forth from memory.
Modern mainstream CPUs have support for "vectors" (in jargon: SIMD) and some do support LA operations as the GPUs do.
So yes, it's the same principle (though GPUs can do a lot more LA operations in parallel, so the gain is bigger there).

Related

Need help understanding Kernel Transport speed on GPU (numba, cupy, cuda)

While GPUs speed math calculations there is a fixed overhead for moving a kernel out to the GPU for execution that is high.
I'm using cupy and numba. THe first time I execute a function call that is using cupy's GPU version of numpy it is quite slow. But the second time it is fast.
I've realized I don't understand how the kernel, or GPU code, gets out to the GPU to run. Operationally I want to understand this better so that I can know when the things I do will accidentally create a slow step due to some kernel transfer. So I need some sorts of rules or rules of thumb understand the concept.
For example, if I multiply two cupy arrays that are stashed on the GPU already I might write C= A*B
At some point the cupy overload on * multiplication has to be coded out on the GPU, and it automagically needs will also get wrapped by loops that break it down into blocks and threads. So presumably this code is some kernel that gets transported out to the GPU. I'm guessing that the next time I call C*D that the GPU no longer needs to be taught what * means and so it will be fast.
But at some point I would imagine the GPU needs to clear out old code so * or other operations not being used at that moment might get flushed from memory, and so later on when the call for A*B happens again there's going to be a penalty in time to recompile it out on the GPU.
Or so I imagine. If I'm right how do I know when these kernels are going to stick around or disappear?
If I'm wrong and this isn't how it works or there is some other slow step (I'm assuming the data is already transported to arrays on the GPU) then what is this slow step and how does organize things so one pay it as little as possible?
I'm trying to avoid writing explicit numba thread managment kernels as one does in cuda++ but just use the standard numba #njit, #vectorize, #stencil decorators. Likewise in Cupy I want to just work at the level of the numpy syntax not dive into thread management.
I've read a lot of docs on this but they just refer to overheads of kernels, not when these get paid and how one controls that so I'm confused.

I don't have a full answer to this yet. But so far the biggest clue I've gotten has come from reading up on the currently undocumented function #cupy.fuse() which makes it more clear than the #numba.jit documents where the kernel launch costs are paid. I have not found the connection to Contexts yet as recommended by #talonmies.
see https://gist.github.com/unnonouno/877f314870d1e3a2f3f45d84de78d56c
The key example is this
c = cupy.arange(4)
##cupy.fuse()
def foo(x):
return x+x+x+x+x
foo(.) will be three times slower with #cupy.fuse() commented out because each "+" involves a kernel load and a kernel free. Fusion merges all the adds into a single kernel so those the launch and free are paid onces. FOr matricies less than 1 million in size on a typical 2018 GPU, the add() is so fast that the launch and free are the dominate times.
I wish I could find some documentation on #fuse. FOr example, does it unroll internal functions the way #jit does. Could I achieve that by stacking #jit and #fuse?
I'm still however largely in the dark about when the costs are getting paid in numba.

How to effectively make use of a GPU for reinforcement learning?

Recently i looked into reinforcement learning and there was one question bugging me, that i could not find an answer for: How is training effectively done using GPUs? To my understanding constant interaction with an environment is required, which for me seems like a huge bottleneck, since this task is often non-mathematical / non-parallelizable. Yet for example Alpha Go uses multiple TPUs/GPUs. So how are they doing it?

Indeed, you will often have interactions with the environment in between learning steps, which will often be better off running on CPU than GPU. So, if your code for taking actions and your code for running an update / learning step are very fast (as in, for example, tabular RL algorithms), it won't be worth the effort of trying to get those on the GPU.
However, when you have a big neural network, that you need to go through whenever you select an action or run a learning step (as is the case in most of the Deep Reinforcement Learning approaches that are popular these days), the speedup of running these on GPU instead of CPU is often enough for it to be worth the effort of running them on GPU (even if it means you're quite regularly ''switching'' between CPU and GPU, and may need to copy some things from RAM to VRAM or the other way around).

When doing off-policy reinforcement learning (which means you can use transitions samples generated by a "behavioral" policy, different from the one you are currently learning), an experience replay is generally used. Therefore, you can grab a bunch of transitions from this large buffer and use a GPU to optimize the learning objective with SGD (c.f. DQN, DDPG).
One instance of CPU-GPU hybrid approach for RL is this - https://github.com/NVlabs/GA3C.
Here, multiple CPUs are used to interact with different instances of the environment. "Trainer" and "Predictor" processes then collect the interactions using multi-process queues, and pass them to a GPU for back-propagation.

Scalable, Efficient Hierarchical Softmax in Tensorflow?

I'm interested in implementing a hierarchical softmax model that can handle large vocabularies, say on the order of 10M classes. What is the best way to do this to both be scalable to large class counts and efficient? For instance, at least one paper has shown that HS can achieve a ~25x speedup for large vocabs when using a 2-level tree where each node sqrt(N) classes. I'm interested also in a more general version for an arbitrary depth tree with an arbitrary branching factor.
There are a few options that I see here:
1) Run tf.gather for every batch, where we gather the indices and splits. This creates problems with large batch sizes and fat trees where now the coefficients are being duplicated a lot, leading to OOM errors.
2) Similar to #1, we could use tf.embedding_lookup which would keep help with OOM errors but now keeps everything on the CPU and slows things down quite a bit.
3) Use tf.map_fn with parallel_iterations=1 to process each sample separately and go back to using gather. This is much more scalable but does not really get close to the 25x speedup due to the serialization.
Is there a better way to implement HS? Are there different ways for deep and narrow vs. short and wide trees?

You mention that you want GPU-class performance:
but now keeps everything on the CPU and slows things down quite a bit
and wish to use 300-unit hidden size and 10M-word dictionaries.
This means that (assuming float32), you'll need 4 * 300 * 10M * 2 bytes = 24 GB just to store the parameters and the gradient for the output layer.
Hierarchical Softmax (HSM) doesn't reduce the memory requirements - it just speeds up the training.
Realistically, you'll need a lot more GPU memory, because you'll also need to store:
other parameters and their gradients
optimizer data, e.g. velocities in momentum training
activations and backpropagated temporary data
framework-specific overhead
Therefore, if you want to do all computation on GPUs, you'll have no choice but to distribute this layer across multiple high-memory GPUs.
However, you now have another problem:
To make this concrete, let's suppose you have a 2-level HSM with 3K classes, with 3K words per class (9M words in total). You distribute the 3K classes across 8 GPUs, so that each hosts 384 classes.
What if all target words in a batch are from the same 384 classes, i.e. they belong to the same GPU? One GPU will be doing all the work, while the other 7 wait for it.
The problem is that even if the target words in a batch belong to different GPUs, you'll still have the same performance as in the worst-case scenario, if you want to do this computation in TensorFlow (This is because TensorFlow is a "specify-and-run" framework -- the computational graph is the same for the best case and the worst case)
What is the best way to do this to both be scalable to large class counts and efficient?
The above inefficiency of model parallelism (each GPU must process the whole batch) suggests that one should try to keep everything in one place.
Let us suppose that you are either implementing everything on the host, or on 1 humongous GPU.
If you are not modeling sequences, or if you are, but there is only one output for the whole sequence, then the memory overhead from copying the parameters, to which you referred, is negligible compared to the memory requirements described above:
400 == batch size << number of classes == 3K
In this case, you could simply use gather or embedding_lookup (Although the copying is inefficient)
However, if you do model sequences of length, say, 100, with output at every time step, then the parameter copying becomes a big issue.
In this case, I think you'll need to drop down to C++ / CUDA C and implement this whole layer and its gradient as a custom op.

An example: Am I understanding GPU advantage correctly?

Just reading a bit about what the advantage of GPU is, and I want to verify I understand on a practical level. Lets say I have 10,000 arrays each containing a billion simple equations to run. On a cpu it would need to go through every single equation, 1 at a time, but with a GPU I could run all 10,000 arrays as as 10,000 different threads, all at the same time, so it would finish a ton faster...is this example spot on or have I misunderstood something?

I wouldn't call it spot on, but I think you're headed in the right direction. Mainly, a GPU is optimized for graphics-related calculations. This does not, however, mean that's all it is capable of.
Without knowing how much detail you want me to go into here, I can say at the very least the concept of running things in parallel is relevant. The GPU is very good at performing many tasks simultaneously in one go (known as running in parallel). CPUs can do this too, but the GPU is specifically optimized to handle much larger numbers of specific calculations with preset data.
For example, to render every pixel on your screen requires a calculation, and the GPU will attempt to do as many of these calculations as it can all at the same time. The more powerful the GPU, the more of these it can handle at once and the faster its clock speed. The end result is a higher-end GPU can run your OS and games in 4k resolution, whereas other cards (or integrated graphics) might only be able to handle 1080p or less.
There's a lot more to this as well, but I figured you weren't looking for the insanely technical explanation.
The bottom line is this: For running a single task on one piece of data, the CPU will normally be faster. A single CPU core is generally much faster than a single GPU core. However, they typically have many cores and for running a single task on many pieces of data (so you have to run it once for each), the GPU will usually be faster. But these are data-driven situations, and as such each situation should be assessed on an individual basis to determine which to use and how to use it.

Embarassingly parallelizable computation with CUDA, how to start?

I need to accelerate many computations I am now doing with PyLab. I thought of using CUDA. The overall unit of computation (A) consists in doing several (thousands) entirely independent smaller computations (B). Each of them involves, at their initial stage, doing 40-41 independent, even smaller, computations (C). So parallel programming should really help. With PyLab the overall (A) takes 20 minutes and (B) takes some tenth of a second.
As a beginner in this realm, my question is what level I should parallelize the computation at, whether at (C) or at (B).
I should clarify that the (C) stage consists in taking a bunch of data (thousands of floats) which is shared between all the (C) processes, and doing various tasks, among which one of the most time consuming is linear regression, which is, too, parallelizable! The output of each procedure (C) is a single float. Each computation (B) consists basically in doing many times procedure (C) and doing a linear regression on the data that comes out. Its output, again, is a single float.
I'm not familiar with CUDA programming so I am basically asking what would be the wisest strategy to start with.

An important consideration when deciding how (and if) to convert your project to CUDA is what type of memory access patterns your code requires. The GPU runs threads in groups of 32, called warps, and to get the best performance, the threads in a warp should access the memory in some basic patterns, that are described in the CUDA Programming Guide (included with CUDA). In general, the more random the access patterns, the more likely the kernel is to become memory bound. In that case, the compute power in the GPU cannot be fully utilized.
The other main case when the compute power in the GPU cannot be fully utilized is if there is conditional logic and loops that causes the threads in a warp to run through different code paths, as the GPU has to run all the threads in the warp through each code path.
If you find that these points may cause issues for your code, you should also do some research to see if there are known alternative ways to implement your code to run better on the GPU (this is often the case).
If you see your question about at which level to parallelize the computation in the light of the above considerations, it may become clear which choice to make.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas