Using `DataParallel` when network needs a shared (constant) `Tensor` - gpu

I would like to use DataParallel to distribute my computations across multiple GPUs along the batch dimension. My network requires a Tensor (let's call it A) internally, which is constant and doens't change through the optimization. It seems that DataParallel does not automatically copy this Tensor to all the GPUs in question, and the network will thus complain that the chunk of the input data x that it sees resides on a different GPU than A.
Is there a way DataParallel can handle this situation automatically? Alternatively, is there a way to copy a Tensor to all GPUs? Or should I just keep one Tensor for each GPU and manually figure out which copy to use depending on where the chunk seen by forward resides?

You should wrap your tensor in torch.nn.Parameter and set requires_grad=False during it's creation.
torch.nn.Parameter does not mean the tensor has to be trainable.
It merely means it is part of the model and should be transferred if needed (e.g. multiple GPU).
If that wasn't the case, there is no way for torch to know which tensor inside __init__ is part of model (you could do some operations on tensors and add to self just to get something done).
I don't see a need for another function to do just that, though the name might be confusing a little bit.

Related

TensorFlow Federated - How to work with SparseTensors

I am using TensorFlow Federated to simulate a scenario in which clients hosted on a remote server can work with our very sparse dataset in a federated setting.
Presently, the code is capable of running with a small subset of the very sparse dataset being loaded on the server-side and passing it to the remote workers hosted on another device. The data is in SVM Light format and can be loaded through sklearn's load_svmlight_file function, but needs to be converted into Tensors to work within tff. The current solution to do so involves converting the very sparse data into a dense array, then setting it up through the tf.data.Dataset.from_tensor_slices function for use with a keras model (following existing examples for tff).
This works, but takes up significant memory resources and is not suitable for the dataset as it cannot be run remotely for more than six samples due to the sparse data's serialized size, nor locally with more than a few hundred samples due to the size in memory.
To mitigate this, I converted the data into SparseTensors, but this approach fails due to the tff.learning.from_keras_model function expecting a pair of TensorSpec input_spec values, not a SparseTensorSpec input_spec with the labels being TensorSpec.
So, are there any concrete examples or known methods to work with SparseTensors within keras models in tff? Or must they be as Tensors for now? The data loads fine when not converted to regular Tensors so I will need to find a solution for working with the sparse data.
If there is presently no way to do so, are there examples of strategies within tff to work with very small subsets of data at a time, either being loaded directly with the remote client or being passed from the server?
Thanks!
I'd say the best approach now is to work with the TF's representation of tf.SparseTensor. That is, a tuple of 3 tensors, indices, values and dense_shape.
So when the problem is with Keras requiring the input to not be sparse tensors, you can pass in the input as for instance a dictionary consisting of these three tensors, which you convert to tf.sparse.SparseTensor as part of your tf.data pipeline.
See also this tutorial which I think is doing something related to what you are looking for, and please ask more detailed questions if needed!

Tensorflow : use of Caching device

I am currently working with tensorflow in a multi-gpu setting, training a model on multiple GPUs, using multiple towers as done in the multi-gpu part of https://www.tensorflow.org/tutorials/deep_cnn.
All the weights of the model are shared across all towers, and for this purpose, placed in cpu memory. To do so, specific device placement are used everywhere in the code where a variable is created/reused (with tf.get_variable).
I was looking for a way to place all variables on cpu in a more convenient way and stepped upon the caching_device argument in variable_scope and was wondering if it is what i am looking for, but i am still not sure as the corresponding weights in the graph has ops placed on gpu corresponding to the device placement used at creation, plus a read operation on cpu.
Do you have informations on the concrete use of caching_device, what is actually happening, where the variable is really placed ?
Thanks.

How to feed a Variable to a placeholder by feed_dict in Tensorflow

One way to solve my problem is first getting the value of my variable (GPU copy to CPU), and then feed the value using feed_dict (CPU copy to GPU). But this involves copying the data from GPU to CPU and then copy back, which is slower than just computing in CPU. I hope to do it without data copying.
Another way to solve my problem is that directly build a graph on the variables I am using. But I have hundreds of variables, I need to build the graph for each variable. All the graphs are executing the same computation, just on different variables. This is even slower because I need to build hundreds of graphs.
Is there a way to solve this problem gracefully in Tensorflow? In Theano, this problem is naturally solved because Theano accepts both CPU tensors and GPU tensors as function arguments.
Thanks!

Why does TensorFlow have a lot of mathematical equations re-implemented?

I was looking through the API in TensorFlow and notice that a lot of mathematical operations that already exist in python and numpy have been re-implemented (or at least given a tensorflow interface). For example:
is there a good reason to do this?
I've been searching over their page but can't find why they'd do this.
I do have some guesses though. One of my main guesses is that they probably want those operations to have some backpropagation effect on whatever Neural network graph that gets implementat. In other words, have their derivatives implemented. Is this one of the reasons? (wish I knew how to even check if my guess is right)
For example, in one of the most basic examples of linear regression, one defines the prediction function that one wants to implement:
product = tf.matmul(x,W)
y = product + b
instead of
product = tf.matmul(x,W)
y = tf.add(product, b)
Somehow the first implementation does not interfere with Stochastic Gradient Descent algorithm for training, so it probably doesn't matter if one uses numpy or tf.add to train? This is one aspect that confuses me, when do I know which one should I be using.
Or maybe they are performance reasons? Or maybe its to give those operations access to GPU if required to use GPUs?
You have to understand that you create a tensorflow graph with this operation, meaning they aren't the same as the numpy functions, they are more an abstraction of them.
Maybe you have noticed that you have to create a session and then evaluate the functions through that session to get a result, where with numpy functions they are executed directly. this is because this graph and its functions define what to do like writing down a formula, but to get results for a specific x (or whatever) you have to insert a value for x. This is what your doing through session and eval.
So to conclude this you define a graph with tensorflow which is a more abstract representation of the functions and the graph also isn't executed at runtime, then it is defined, it will be executed when you call the eval function and through that run the session.
Also notice that you cant mix numpy functions and tensorflow functions directly but you can define own tensorflow functions (https://www.tensorflow.org/versions/r0.9/how_tos/adding_an_op/index.html)
Btw I guess most of the tensorflow functions are using numpy under the hood. :)

Adjust existing Tensorflow Graphs (VGG)

I want to use the VGG converted tensorflow model from Ryan.
https://github.com/ry/tensorflow-vgg16
Now I want to adjust the layers and add another layer or change the fully connected layers. But I don't know how to get the single layers/weights out of the graphDef or how to adjust the graph.
Short answer: you can't adjust a graph, but there are probably ways to get what you want accomplished.
Long answer: TensorFlow Graph objects are structurally immutable. You can modify some aspects of them (e.g., the shape of a tensor flowing into a node), but you can't remove a node or add a node between two existing nodes. However, there are a couple ways to get the same effect:
If your changes are limited to additions only, then there's no problem with doing this. For instance, if you wanted to add a layer on the end of a network, go for it. Likewise, you can "replace" the last layer by simply adding a new layer which takes the second-to-last layer as input and just ignoring the existing last layer. When you run the graph, if you never ask for the output of the original last layer, TensorFlow will never compute it.
If you need to do modifications, one way is to slowly build up a copy of the graph node by node. So read in the original graph definition, then build your own new graph by iterating over the original and adding similar nodes to your new copy. This is somewhat tedious and can be error-prone. Moreover...
...You might not need to "adjust" the graph at all. If you want something similar to that VGG-16 implementation, you can just work off the python code directly. Don't like the width of fc6? Just edit the code that generates it.
This brings us to the real issue, though. If your goal is to modify the network and be able to re-use the weights, then 2. and 3. aren't going to work. Realistically, this isn't possible in a lot of cases. For instance, if I wanted to add or remove a layer in the middle of VGG-16 (say, adding another convolutional layer), the pre-trained weights are no longer valid. You might be able to salvage any pre-trained weights which are upstream of your changes, but everything downstream will basically be wrong. You'll need to retrain the network anyways. (Maybe you can use the pre-trained networks as initialization, but you'll still need to retrain.) Even if you're just adding to the network (as in 1.), you'll still need to train the network.
Thanks! I have recreated the graph and then loaded every single weight by getting the value of the graph definition.
This was done by graph.get_tensor_by_name('import/...') where ... is the name of the weight
https://www.tensorflow.org/versions/r0.9/how_tos/tool_developers/index.html