Why back propagation for Conv in tensorflow are separated into two operations? - tensorflow

I am trying to implement a custom convolution operation in tensorflow with c++ and cuda, and I found that the back-propagation for the Conv2D in tensorflow are implemented via two separate operations. Indeed, I found there are two operation implementations, namely conv_grad_filter_ops.cc and conv_grad_input_ops.cc in the tensorflow source code, which means the gradients for filter and input are calculated respectively. May I ask what is the idea behind this implementation? Why they were not simply merged together as one single operation?

Alright, I did a test and found that there's about 30% speed boost if the back propagation for different inputs are split into different TF ops compared with wrapped into one single TF op. This is against intuition, perhaps there's something related with TF's architecture. Note: my test was based on CUDA im2col/col2im with CuBLAS instead of CuDNN.

Related

Specify Keras GPU without using CUDA_VISIBLE_DEVICES

I have a system with two GPUs, and am using Keras with Tensorflow backend. Gpu:0 is being allocated to PyCUDA, which is performing a unique operation which is fed forward to Keras, and changes with each batch. As such, I would like to run a Keras model on gpu:1 while leaving gpu:0 allocated to PyCUDA.
Is there any way to do this? Looking through prior threads I've found several depreciated solutions.
So I don't think that this feature is meaningfully implemented in Keras currently. Found a workaround that I recommend whereby you just create multiple processes using Python's default multiprocessing library.
Note: Currently for this setup you need to spawn the new process, rather than fork it, to avoid a weird interaction with one of the PyCUDA backend libraries.

The difference between tf.layers, tf.contrib, and tf.nn in Tensorflow [duplicate]

In tensorflow 1.4, I found two functions that do batch normalization and they look same:
tf.layers.batch_normalization (link)
tf.contrib.layers.batch_norm (link)
Which function should I use? Which one is more stable?
Just to add to the list, there're several more ways to do batch-norm in tensorflow:
tf.nn.batch_normalization is a low-level op. The caller is responsible to handle mean and variance tensors themselves.
tf.nn.fused_batch_norm is another low-level op, similar to the previous one. The difference is that it's optimized for 4D input tensors, which is the usual case in convolutional neural networks. tf.nn.batch_normalization accepts tensors of any rank greater than 1.
tf.layers.batch_normalization is a high-level wrapper over the previous ops. The biggest difference is that it takes care of creating and managing the running mean and variance tensors, and calls a fast fused op when possible. Usually, this should be the default choice for you.
tf.contrib.layers.batch_norm is the early implementation of batch norm, before it's graduated to the core API (i.e., tf.layers). The use of it is not recommended because it may be dropped in the future releases.
tf.nn.batch_norm_with_global_normalization is another deprecated op. Currently, delegates the call to tf.nn.batch_normalization, but likely to be dropped in the future.
Finally, there's also Keras layer keras.layers.BatchNormalization, which in case of tensorflow backend invokes tf.nn.batch_normalization.
As show in doc, tf.contrib is a contribution module containing volatile or experimental code. When function is complete, it will be removed from this module. Now there are two, in order to be compatible with the historical version.
So, the former tf.layers.batch_normalization is recommended.

Is it possible to train pytorch and tensorflow model together on one GPU?

I have a pytorch model and a tensorflow model, I want to train them together on one GPU, following the process bellow: input --> pytorch model--> output_pytorch --> tensorflow model --> output_tensorflow --> pytorch model.
Is is possible to do this? If answer is yes, is there any problem which I will encounter?
Thanks in advance.
I haven't done this but it is possible but implementing is can be a little bit.
You can consider each network as a function, you want to - in some sense - compose these function to form your network, to do this you can compute the final function by just giving result of one network to the other and then use chain-rule to compute the derivatives(using symbolic differentiation from both packages).
I think a good way for implementing this you might be to wrap TF models as a PyTorch Function and use tf.gradients for computing the backward pass.
Doing gradient updates can really get hard (because some variables exist in TF's computation graph) you can turn TF variables to PyTorch Variable turn them into placeholdes in TF computation graph, feed them in feed_dict and update them using PyTorch mechanisms, but I think it would be really hard to do, instead if you do your updates inside backward method of the function you might be able to do the job(it is really ugly but might do the job).

skipping layer in backpropagation in keras

I am using Keras with tensorflow backend and I am curious whether it is possible to skip a layer during backpropagation but have it execute in the forward pass. So here is what I mean
Lambda (lambda x: a(x))
I want to apply a to x in the forward pass but I do not want a to be included in the derivation when the backprop takes place.
I was trying to find a solution bit I could not find anything. Can somebody help me out here?
UPDATE 2
In addition to tf.py_func, there is now an official guide on how to add a custom op.
UPDATE
See this question for an example of writing a custom op with gradient purely in Python without needing to rebuild anything. Note that there are some limitations to the method (see the documentation of tf.py_func).
Not exactly a solution to the problem, but still kind of an answer and too long for comments.
That's not even a Keras issue, but a TensorFlow one. Each op defines its own gradient computation that is used during backpropagation. I you really wanted to something like that, you would need to implement the op into TensorFlow yourself (no easy feat) and define the gradient that you want - because you can't have "no gradient", if anything it would be 1 or 0 (otherwise you can't go on with backpropagation). There is a tf.NoGradient function in TensorFlow which causes an op to propagate zeros, but I don't think it is meant to / can be used out of TensorFlow own internals.
UPDATE
Okay so a bit more of context. TensorFlow graphs are built of ops, which are implemented by kernels; this is basically a 1-to-1 mapping, except that there may be for example a CPU and a GPU kernel for an op, hence the differentiation. The set of ops supported by TensorFlow is usually static, I mean it can change with newer versions, but in principle you cannot add your own ops, because the ops of a graph go into the Protobuf serialized format, so if you made your own ops then you would not be able to share your graph. Ops are then defined at C++ level with the macro REGISTER_OP (see for example here), and kernels with REGISTER_KERNEL_BUILDER (see for example here).
Now, where do gradients come into play? Well, the funny thing is that the gradient of an op is not defined at C++ level; there are ops (and kernels) that implement the gradient of other ops (if you look at the previous files you'll find ops/kernels with the name ending in Grad), but (as far as I'm aware) these are not explicitly "linked" at this level. It seems that the associations between ops and their gradients is defined in Python, usually via tf.RegisterGradient or the aforementioned tf.NoGradient (see for example here, Python modules starting with gen_ are autogenerated with the help of the C++ macros); these registrations inform the backpropagation algorithm about how to compute the gradient of the graph.
So, how to actually work this out? Well, you need to create at least one op in C++ with the corresponding kernel/s implementing the computation that you want for your forward pass. Then, if the gradient computation that you want to use can be expressed with existing TensorFlow ops (which is most likely), you would just need to call tf.RegisterGradient in Python and do the computation there in "standard" TensorFlow. This is quite complicated, but the good news is it's possible, and there's even an example for it (although I think they kinda forgot the gradient registration part in that one)! As you will see, the process involves compiling the new op code into a library (btw I'm not sure if any of this may work on Windows) that is then loaded from Python (obviously this involves going through the painful process of manual compilation of TensorFlow with Bazel). A possibly more realistic example can be found in TensorFlow Fold, an extension of TensorFlow for structured data that register (as of one) one custom operation here through a macro defined here that calls REGISTER_OP, and then in Python it loads the library and register its gradient here through their own registration function defined here that simply calls tf.NotDifferentiable (another name for tf.NoGradient)
tldr: It is rather hard, but it can be done and there are even a couple of examples out there.
As mentioned in #jdehesa's comments. You can implement your function with an "alternative gradient". Forgive me if my math is not correct, but I think a derivative returning "1" would be the correct way to have no effect on the backpropagation while still passing the learning through. For how to construct it, see here. The example I cited goes further and allows you to construct an activation function from a python function. So in place of the spiky function, substitute your function a, and in place of his derivative d_spiky replace it with
def constant(x):
return 1
So on the forward pass, a is applied in the layer and the the backwards pass 1 is applied which should simply pass the weight adjustments through.
You can then just create an Activation layer in Keras using this function.

Can Torch optim package support multiple inputs

I'm trying to use the torch7 optim package adam algorithm implementation
for optimizing a neural network which takes two independent inputs. Can this be done? The code seems to only support a single input vector. Is there some other implementation which can take a generic table of inputs?
The reference usage I saw, upon which I based my code is here
In principle, if you flatten all your parameters and gradients into two tensors then optim can handle it. See getParameters() function.