Tensorflow : use of Caching device - tensorflow

I am currently working with tensorflow in a multi-gpu setting, training a model on multiple GPUs, using multiple towers as done in the multi-gpu part of https://www.tensorflow.org/tutorials/deep_cnn.
All the weights of the model are shared across all towers, and for this purpose, placed in cpu memory. To do so, specific device placement are used everywhere in the code where a variable is created/reused (with tf.get_variable).
I was looking for a way to place all variables on cpu in a more convenient way and stepped upon the caching_device argument in variable_scope and was wondering if it is what i am looking for, but i am still not sure as the corresponding weights in the graph has ops placed on gpu corresponding to the device placement used at creation, plus a read operation on cpu.
Do you have informations on the concrete use of caching_device, what is actually happening, where the variable is really placed ?
Thanks.

Related

Using `DataParallel` when network needs a shared (constant) `Tensor`

I would like to use DataParallel to distribute my computations across multiple GPUs along the batch dimension. My network requires a Tensor (let's call it A) internally, which is constant and doens't change through the optimization. It seems that DataParallel does not automatically copy this Tensor to all the GPUs in question, and the network will thus complain that the chunk of the input data x that it sees resides on a different GPU than A.
Is there a way DataParallel can handle this situation automatically? Alternatively, is there a way to copy a Tensor to all GPUs? Or should I just keep one Tensor for each GPU and manually figure out which copy to use depending on where the chunk seen by forward resides?
You should wrap your tensor in torch.nn.Parameter and set requires_grad=False during it's creation.
torch.nn.Parameter does not mean the tensor has to be trainable.
It merely means it is part of the model and should be transferred if needed (e.g. multiple GPU).
If that wasn't the case, there is no way for torch to know which tensor inside __init__ is part of model (you could do some operations on tensors and add to self just to get something done).
I don't see a need for another function to do just that, though the name might be confusing a little bit.

Training on multi-GPUs with a small batch size

I am running TensorFlow on a machine which has two GPUs, each with 3 GB memory. My batch size is only 2GB, and so can fit on one GPU. Is there any point in training with both GPUs (using CUDA_VISIBLE_DEVICES)? If I did, how would TensorFlow distribute the training?
With regards to memory: I assume that you mean that one data batch is 2GB. However, Tensorflow also requires memory to store variables as well as hidden layer results etc. (to compute gradients). For this reason it also depends on your specific model whether or not the memory will be enough. Your best bet would be to just try with one GPU and see if the program crashes due to memory errors.
With regards to distribution: Tensorflow doesn't do this automatically at all. Each op is placed on some device. By default, if you have any number of GPUs available, all GPU-compatible ops will be placed on the first GPU and the rest on the CPU. This is despite Tensorflow reserving all memory on all GPUs by default.
You should have a look at the GPU guide on the Tensorflow website. The most important thing is that you can use the with tf.device context manager to place ops on other GPUs. Using this, the idea would be to split your batch into X chunks (X = number of GPUs) and define your model on each device, each time taking the respective chunk as input and making sure to reuse variables.
If you are using tf.Estimator, there is some information in this question. It is very easy to do distributed execution here using just two simple wrappers, but I personally haven't been able to use it successfully (pretty slow and crashes randomly with a segfault).

What is the difference of static Computational Graphs in tensorflow and dynamic Computational Graphs in Pytorch?

When I was learning tensorflow, one basic concept of tensorflow was computational graphs, and the graphs was said to be static.
And I found in Pytorch, the graphs was said to be dynamic.
What's the difference of static Computational Graphs in tensorflow and dynamic Computational Graphs in Pytorch?
Both frameworks operate on tensors and view any model as a directed acyclic graph (DAG), but they differ drastically on how you can define them.
TensorFlow follows ‘data as code and code is data’ idiom. In TensorFlow you define graph statically before a model can run. All communication with outer world is performed via tf.Session object and tf.Placeholder which are tensors that will be substituted by external data at runtime.
In PyTorch things are way more imperative and dynamic: you can define, change and execute nodes as you go, no special session interfaces or placeholders. Overall, the framework is more tightly integrated with Python language and feels more native most of the times. When you write in TensorFlow sometimes you feel that your model is behind a brick wall with several tiny holes to communicate over. Anyways, this still sounds like a matter of taste more or less.
However, those approaches differ not only in a software engineering perspective: there are several dynamic neural network architectures that can benefit from the dynamic approach. Recall RNNs: with static graphs, the input sequence length will stay constant. This means that if you develop a sentiment analysis model for English sentences you must fix the sentence length to some maximum value and pad all smaller sequences with zeros. Not too convenient, huh. And you will get more problems in the domain of recursive RNNs and tree-RNNs. Currently Tensorflow has limited support for dynamic inputs via Tensorflow Fold. PyTorch has it by-default.
Reference:
https://medium.com/towards-data-science/pytorch-vs-tensorflow-spotting-the-difference-25c75777377b
https://www.reddit.com/r/MachineLearning/comments/5w3q74/d_so_pytorch_vs_tensorflow_whats_the_verdict_on/
Both TensorFlow and PyTorch allow specifying new computations at any point in time. However, TensorFlow has a "compilation" steps which incurs performance penalty every time you modify the graph. So TensorFlow optimal performance is achieved when you specify the computation once, and then flow new data through the same sequence of computations.
It's similar to interpreters vs. compilers -- the compilation step makes things faster, but also discourages people from modifying the program too often.
To make things concrete, when you modify the graph in TensorFlow (by appending new computations using regular API, or removing some computation using tf.contrib.graph_editor), this line is triggered in session.py. It will serialize the graph, and then the underlying runtime will rerun some optimizations which can take extra time, perhaps 200usec. In contrast, running an op in previously defined graph, or in numpy/PyTorch can be as low as 1 usec.
In tensorflow you first have to define the graph, then you execute it.
Once defined you graph is immutable: you can't add/remove nodes at runtime.
In pytorch, instead, you can change the structure of the graph at runtime: you can thus add/remove nodes at runtime, dynamically changing its structure.

Is it really necessary to define devices when using data parallelism with Tensorflow?

I learned from its original white paper that Tensorflow applies a greedy method for deciding on how to distribute operations among existing devices. However, the only multi-gpu example available on Using multiple GPUs suggests that Operations be gathered into "towers" and explicitly assigned to specific devices. Is it possible to simply define multiple copies of those towers and have Tensorflow to automagically assign them to different GPUs?

Tensorflow: How to write clean code for multi GPU model parallelism?

Currently i'm implementing a large custom model and referencing the multi gpu example of CIFAR 10 that comes along with tensorflow. However the code I ended up writing based on that was not clean and was error prone. For e.g. I had to find every trainable variable and add "with tf.device('/cpu:0')".
Are there more efficient/cleaner ways of adapting for multi gpu execution?
Many thanks for any support.
Here's an example from Rafal
You make a loop over towers with the body constructing ith tower as with tf.device(assign_to_gpu(i)). The function assign_to_gpu treats variables differently and assigns them onto "ps-device".
Note: we found that when GPUs are p2p connected, training was faster when variables were kept gpu:0 rather than cpu:0