CUDA-like optimization on Tensorflow-GPU - tensorflow

I am trying to implement a neural network architecture (Self Organizing Maps) for execution on GPUs. I am exploring TensorFlow for this task.
In TensorFlow, I noticed that you just have to specify gpu as the device to execute something on the gpu like in this post. It seems that the way the operations are parallelized is decided by TF and the user does not have options to take optimization decisions. The "Optimizing for GPU" section on TensorFlow Performance Guide also does not talk about explicit control over parallelizing operations.
My question is, can I do CUDA-like optimization in TensorFlow? More elaborately, is it possible to define which operation will be parallelized (like defining CUDA kernels for parallel operations)?

Yes, but you probably don't want to.
At the most extreme you can define your own op (as described here: https://www.tensorflow.org/extend/adding_an_op).
You can implement it as a GPU Kernel and write whatever you want.
You probably don't want to. The default operations are likely well optimized. I doubt you would be able to squeeze anything out significant out of them.
You can decide the device placement for each individual operation (by using tf.device), but you will incur data transfer overhead every time you switch. This should cover the cases where there's some operation that it slow to execute on the GPU.
If you want to process part of the data on CPU and part on the GPU you can slice your data and do 2 operations (one on CPU and one on GPU).

By default, in TF, in graph mode (not in eager mode), everything, all the TF ops run in parallel. There is a thread pool for that, and its size is controlled via inter_op_parallelism_threads. (See also.)
That does not necessarily mean that e.g. multiple matmul will really run in parallel, if they are internally synchronized. That is the case for most CUDA ops, as there is only a single CUDA stream. See here.

Related

What is a fused kernel (or fused layer) in deep learning?

I am reading the Apex AMP documentation:
A Python-only build omits:
Fused kernels required to use apex.optimizers.FusedAdam.
Fused kernels
required to use apex.normalization.FusedLayerNorm.
Fused kernels that
improve the performance and numerical stability of
apex.parallel.SyncBatchNorm.
Fused kernels that improve the
performance of apex.parallel.DistributedDataParallel and apex.amp.
DistributedDataParallel, amp, and SyncBatchNorm will still be usable,
but they may be slower.
There also seems to be a "FusedAdam" optimizer:
The Adam optimizer in Pytorch (like all Pytorch optimizers) carries
out optimizer.step() by looping over parameters, and launching a
series of kernels for each parameter. This can require hundreds of
small launches that are mostly bound by CPU-side Python looping and
kernel launch overhead, resulting in poor device utilization.
Currently, the FusedAdam implementation in Apex flattens the
parameters for the optimization step, then carries out the
optimization step itself via a fused kernel that combines all the Adam
operations. In this way, the loop over parameters as well as the
internal series of Adam operations for each parameter are fused such
that optimizer.step() requires only a few kernel launches.
The current implementation (in Apex master) is brittle and only works
with Amp opt_level O2. I’ve got a WIP branch to make it work for any
opt_level (https://github.com/NVIDIA/apex/pull/351). I recommend
waiting until this is merged then trying it.
This partially explains it. I'm left with more questions:
What is meant by kernel? A layer or an optimizer?
Is the idea of fused layer the same as a fused optimizer?
"Kernel" here is for computation kernels: https://en.wikipedia.org/wiki/Compute_kernel
Operations like convolution are often implemented using compute kernels for better efficiency. Compute kernels can be written using C, CUDA, OpenCL or even assembly for maximum efficiency. It is therefore not surprizing that "a Python-only build" does not support...
"Fusing" means commonalization of computation steps. Basically, it's an implementation trick to run code more efficiently by combining similar operations in a single hardware (GPU, CPU or TPU) operation. Therefore, a "fusedLayer" is a layer where operations benefit from a "fused" implementation.

Does Tensorflow automaticaly use multiple CPUs?

I have programmed some code doing an inference with Tensorflow's C API (CPU only). It is running on a cluster node, where I have access to 24 CPUs and 1 GPU. I do not make use of the GPU as I will need to do the task CPU-only later on.
Somehow every time I call the Tensorflow-Code from the other program (OpenFOAM) Tensorflow seems to run on all CPUs parallelized. However I have not done anything to cause this behavior. Now I would like to know whether Tensorflow does this parallelization by default?
Greets and thanks in advance!
I am not sure how you are using tensorflow. But a typical TensorFlow training has an input pipeline which can be thought as an ETL process. Following are the main activities involved:
Extract: Read data from persistent storage
Transform: Use CPU cores to parse and perform preprocessing operations on the data such as image decompression, data augmentation transformations (such as random crop, flips, and color distortions), shuffling, and batching.
Load: Load the transformed data onto the accelerator device(s) (for example, GPU(s) or TPU(s)) that execute the machine learning model.
CPUs are generally used during the data transformation. During the transformation, the data input elements are preprocessed. To improve the performance of the pre-processing, it is parallelized across multiple CPU cores by default.
Tensorflow provides the tf.data API which offers the tf.data.Dataset.map transformation. To control the parallelism, the map provides the num_parallel_calls argument.
Read more on this from here:
https://www.tensorflow.org/guide/performance/datasets

By default, does TensorFlow use GPU/CPU simultaneously for computing or GPU only?

By default, TensorFlow will use our available GPU devices. That said, does TensorFlow use GPUs and CPUs simultaneously for computing, or GPUs for computing and CPUs for job handling (no matter how, CPUs are always active, as I think)?
Generally it uses both, the CPU and the GPU (assuming you are using a GPU-enabled TensorFlow). What actually gets used depends on the actual operations that your code is using.
For each operation available in TensorFlow, there are several "implementations" of such operation, generally a CPU implementation and a GPU one. Some operations only have CPU implementations as it makes no sense for a GPU implementation, but overall most operations are available for both devices.
If you make custom operations then you need to provide implementations that you want.
TensorFlow operations come packaged with a list of devices they can execute on and a list of associated priorities.
For example, a convolution is very conducive to computation on a GPU; but can still be done on a CPU whereas scalar additions should definitely be done on a CPU. You can override this selection using tf.Device and the key attached to the device of interest.
Someone correct me if I'm wrong.
But from what I'm aware TensorFlow only uses either GPU or CPU depending on what installation you ran. For example if you used pip install TensorFlow for python 2 or python3 -m pip install TensorFlow for python 3 you'll only use the CPU version.
Vise versa for GPU.
If you still have any questions or if this did not correctly answer your question feel free to ask me more.

Debugging batching in Tensorflow Serving (no effect observed)

I have a small web server that gets input in terms of sentences and needs to return a model prediction using Tensorflow Serving. It's working all fine and well using our single GPU, but now I'd like to enable batching such that Tensorflow Serving waits a bit to group incoming sentences before processing them together in one batch on the GPU.
I'm using the predesigned server framework with the predesigned batching framework using the initial release of Tensorflow Serving. I'm enabling batching using the --batching flag and have set batch_timeout_micros = 10000 and max_batch_size = 1000. The logging does confirm that batching is enabled and that the GPU is being used.
However, when sending requests to the serving server the batching has minimal effect. Sending 50 requests at the same time almost linearly scales in terms of time usage with sending 5 requests. Interestingly, the predict() function of the server is run once for each request (see here), which suggests to me that the batching is not being handled properly.
Am I missing something? How do I check what's wrong with the batching?
Note that this is different from How to do batching in Tensorflow Serving? as that question only examines how to send multiple requests from a single client, but not how to enable Tensorflow Serving's behind-the-scenes batching for multiple separate requests.
(I am not familiar with the server framework, but I'm quite familiar with HPC and with cuBLAS and cuDNN, the libraries TF uses to do its dot products and convolutions on GPU)
There are several issues that could cause disappointing performance scaling with the batch size.
I/O overhead, by which I mean network transfers, disk access (for large data), serialization, deserialization and similar cruft. These things tend to be linear in the size of the data.
To look into this overhead, I suggest you deploy 2 models: one that you actually need, and one that's trivial, but uses the same I/O, then subtract the time needed by one from another.
This time difference should be similar to the time running the complex model takes, when you use it directly, without the I/O overhead.
If the bottleneck is in the I/O, speeding up the GPU work is inconsequential.
Note that even if increasing the batch size makes the GPU faster, it might make the whole thing slower, because the GPU now has to wait for the I/O of the whole batch to finish to even start working.
cuDNN scaling: Things like matmul need large batch sizes to achieve their optimal throughput, but convolutions using cuDNN might not (At least it hasn't been my experience, but this might depend on the version and the GPU arch)
RAM, GPU RAM, or PCIe bandwidth-limited models: If your model's bottleneck is in any of these, it probably won't benefit from bigger batch sizes.
The way to check this is to run your model directly (perhaps with mock input), compare the timing to the aforementioned time difference and plot it as a function of the batch size.
By the way, as per the performance guide, one thing you could try is using the NCHW layout, if you are not already. There are other tips there.

What are the possible reasons that a deep learning model runs slower on GPU than running on CPU?

My GPU which is Titan X should have be faster than the CPU which is Intel(R) Xeon(R) CPU E5-2643 v3 # 3.40GHz. But two of my models runs a little slower on the GPU. One model runs much faster on the GPU. Among those two models, one is implemented with tensorflow, the other is implemented with theano. The common character of the two models is they all belong to hierarchical Bi-LSTM model which means the last outputs of the bottom Bi-LSTM are fed into the other as inputs. So neither of the models are too simple.
So I would like to inquire what are the possible reasons that they run slower on GPU thant on CPU?
I could provide some info for the theano side:
Theano has been having multiple issues with scan, which is its workhorse for RNN loops.
Here's some of them:
Since theano does not know shape information at compile time, the resulting compiled routine can be suboptimal (like using gemv for vector-vector dot).
(as of Nov 2016) The current version of scan is implemented in cython, which have some overhead over a pure C++ version. If the RNN don't have much computation density on a single step, this can be significant.
It does not pipeline well. Using a scan to implement a map operation can often slower than using the underlying operation directly. Apparently the optimizer is premature and still can't recognize this kind of problem.
Solution:
Try upgrading to dev version. They have been making various improvements overtime.
Try unrolling the RNN (using a plain loop to build graph instead of scan), if you can afford the compilation time.
I made a PR to address gemv issue, only for old GPU backend. Give it a try (if not merged yet). Now it's part of dev master branch.