The difference between tf.layers, tf.contrib, and tf.nn in Tensorflow [duplicate] - tensorflow

In tensorflow 1.4, I found two functions that do batch normalization and they look same:
tf.layers.batch_normalization (link)
tf.contrib.layers.batch_norm (link)
Which function should I use? Which one is more stable?

Just to add to the list, there're several more ways to do batch-norm in tensorflow:
tf.nn.batch_normalization is a low-level op. The caller is responsible to handle mean and variance tensors themselves.
tf.nn.fused_batch_norm is another low-level op, similar to the previous one. The difference is that it's optimized for 4D input tensors, which is the usual case in convolutional neural networks. tf.nn.batch_normalization accepts tensors of any rank greater than 1.
tf.layers.batch_normalization is a high-level wrapper over the previous ops. The biggest difference is that it takes care of creating and managing the running mean and variance tensors, and calls a fast fused op when possible. Usually, this should be the default choice for you.
tf.contrib.layers.batch_norm is the early implementation of batch norm, before it's graduated to the core API (i.e., tf.layers). The use of it is not recommended because it may be dropped in the future releases.
tf.nn.batch_norm_with_global_normalization is another deprecated op. Currently, delegates the call to tf.nn.batch_normalization, but likely to be dropped in the future.
Finally, there's also Keras layer keras.layers.BatchNormalization, which in case of tensorflow backend invokes tf.nn.batch_normalization.

As show in doc, tf.contrib is a contribution module containing volatile or experimental code. When function is complete, it will be removed from this module. Now there are two, in order to be compatible with the historical version.
So, the former tf.layers.batch_normalization is recommended.

Related

Why back propagation for Conv in tensorflow are separated into two operations?

I am trying to implement a custom convolution operation in tensorflow with c++ and cuda, and I found that the back-propagation for the Conv2D in tensorflow are implemented via two separate operations. Indeed, I found there are two operation implementations, namely conv_grad_filter_ops.cc and conv_grad_input_ops.cc in the tensorflow source code, which means the gradients for filter and input are calculated respectively. May I ask what is the idea behind this implementation? Why they were not simply merged together as one single operation?
Alright, I did a test and found that there's about 30% speed boost if the back propagation for different inputs are split into different TF ops compared with wrapped into one single TF op. This is against intuition, perhaps there's something related with TF's architecture. Note: my test was based on CUDA im2col/col2im with CuBLAS instead of CuDNN.

Behaviour of Alpha Dropout in Training and Inference time

I am in the process of implementing a self normalizing neural network using the tensorflow. There are currently tensorflow "primitives" in the form of tf.nn.selu and tf.contrib.nn.alpha_dropout that should make this an easy process.
My problem is with tf.contrib.nn.alpha_dropout. I was expecting it to have a boolean switch for when you are in training and when you are in inference as does the usual dropout function used with other activation functions.
In the original implementation by the authors, we indeed see that they have this boolean switch (training) in the selu dropout function (dropout_selu).
Is there something I am missing?
tf.contrib.nn.alpha_dropout should be seen as an analogue to tf.nn.dropout. The latter function also does not have an argument for a training switch. It is not to be confused with tf.layers.dropout, which wraps tf.nn.dropout and has a training argument. As we can see in the implementation, the layers version returns either the result of nn.dropout or the identity depending on the training switch. It should be relatively easy to define your own wrapper around alpha_dropout in a similar manner.
To avoid any confusion: layers.dropout eventually calls the "keras layers" version of dropout which is the implementation linked above.

Tensorflow - Why are there so many similar or even duplicate functions in tf.nn and tf.layers / tf.losses / tf.contrib.layers etc?

In Tensorflow (as of v1.2.1), it seems that there are (at least) two parallel APIs to construct computational graphs. There are functions in tf.nn, like conv2d, avg_pool, relu, dropout and then there are similar functions in tf.layers, tf.losses and elsewhere, like tf.layers.conv2d, tf.layers.dense, tf.layers.dropout.
Superficially, it seems that this situation only serves to confuse: for example, tf.nn.dropout uses a 'keep rate' while tf.layers.dropout uses a 'drop rate' as an argument.
Does this distinction have any practical purpose for the end-user / developer?
If not, is there any plan to cleanup the API?
Tensorflow proposes on the one hand a low level API (tf., tf.nn....), and on the other hand, a higher level API (tf.layers., tf.losses.,...).
The goal of the higher level API is to provide functions that greatly simplify the design of the most common neural nets. The lower level API is there for people with special needs, or who wishes to keep a finer control of what is going on.
The situation is a bit confused though, because some functions have the same or similar names, and also, there is no clear way to distinguish at first sight which namespace correspond to which level of the API.
Now, let's look at conv2d for example. A striking difference between tf.nn.conv2d and tf.layers.conv2d is that the later takes care of all the variables needed for weights and biases. A single line of code, and voilĂ , you just created a convolutional layer. With tf.nn.conv2d, you have to take declare the weights variable yourself before passing it to the function. And as for the biases, well, they are actually not even handled: you need to add them yourself later.
Add to that that tf.layers.conv2d also proposes to add regularization and activation in the same function call, you can imagine how this can reduce code size when one's need is covered by the higher-level API.
The higher level also makes some decisions by default that could be considered as best practices. For example, losses in tf.losses are added to the tf.GraphKeys.LOSSES collection by default, which makes recovery and summation of the various component easy and somewhat standardized. If you use the lower level API, you would need to do all of that yourself. Obviously, you would need to be careful when you start mixing low and high level API functions there.
The higher-level API is also an answer to a great need from people that have been otherwise used to similarly high-level function in other frameworks, Theano aside. This is rather obvious when one ponders the number of alternative higher level APIs built on top of tensorflow, such as keras 2 (now part of the official tensorflow API), slim (in tf.contrib.slim), tflearn, tensorlayer, and the likes.
Finally, if I may add an advice: if you are beginning with tensorflow and do not have a preference towards a particular API, I would personnally encourage you to stick to the tf.keras.* API:
Its API is friendly and at least as good as the other high-level APIs built on top of the low-level tensorflow API
It has a clear namespace within tensorflow (although it can -- and sometimes should -- be used with parts from other namespaces, such as tf.data)
It is now a first-class citizen of tensorflow (it used to be in tf.contrib.keras), and care is taken to make new tensorflow features (such as eager) compatible with keras.
Its generic implementation can use other toolkits such as CNTK, and so does not lock you to tensorflow.

skipping layer in backpropagation in keras

I am using Keras with tensorflow backend and I am curious whether it is possible to skip a layer during backpropagation but have it execute in the forward pass. So here is what I mean
Lambda (lambda x: a(x))
I want to apply a to x in the forward pass but I do not want a to be included in the derivation when the backprop takes place.
I was trying to find a solution bit I could not find anything. Can somebody help me out here?
UPDATE 2
In addition to tf.py_func, there is now an official guide on how to add a custom op.
UPDATE
See this question for an example of writing a custom op with gradient purely in Python without needing to rebuild anything. Note that there are some limitations to the method (see the documentation of tf.py_func).
Not exactly a solution to the problem, but still kind of an answer and too long for comments.
That's not even a Keras issue, but a TensorFlow one. Each op defines its own gradient computation that is used during backpropagation. I you really wanted to something like that, you would need to implement the op into TensorFlow yourself (no easy feat) and define the gradient that you want - because you can't have "no gradient", if anything it would be 1 or 0 (otherwise you can't go on with backpropagation). There is a tf.NoGradient function in TensorFlow which causes an op to propagate zeros, but I don't think it is meant to / can be used out of TensorFlow own internals.
UPDATE
Okay so a bit more of context. TensorFlow graphs are built of ops, which are implemented by kernels; this is basically a 1-to-1 mapping, except that there may be for example a CPU and a GPU kernel for an op, hence the differentiation. The set of ops supported by TensorFlow is usually static, I mean it can change with newer versions, but in principle you cannot add your own ops, because the ops of a graph go into the Protobuf serialized format, so if you made your own ops then you would not be able to share your graph. Ops are then defined at C++ level with the macro REGISTER_OP (see for example here), and kernels with REGISTER_KERNEL_BUILDER (see for example here).
Now, where do gradients come into play? Well, the funny thing is that the gradient of an op is not defined at C++ level; there are ops (and kernels) that implement the gradient of other ops (if you look at the previous files you'll find ops/kernels with the name ending in Grad), but (as far as I'm aware) these are not explicitly "linked" at this level. It seems that the associations between ops and their gradients is defined in Python, usually via tf.RegisterGradient or the aforementioned tf.NoGradient (see for example here, Python modules starting with gen_ are autogenerated with the help of the C++ macros); these registrations inform the backpropagation algorithm about how to compute the gradient of the graph.
So, how to actually work this out? Well, you need to create at least one op in C++ with the corresponding kernel/s implementing the computation that you want for your forward pass. Then, if the gradient computation that you want to use can be expressed with existing TensorFlow ops (which is most likely), you would just need to call tf.RegisterGradient in Python and do the computation there in "standard" TensorFlow. This is quite complicated, but the good news is it's possible, and there's even an example for it (although I think they kinda forgot the gradient registration part in that one)! As you will see, the process involves compiling the new op code into a library (btw I'm not sure if any of this may work on Windows) that is then loaded from Python (obviously this involves going through the painful process of manual compilation of TensorFlow with Bazel). A possibly more realistic example can be found in TensorFlow Fold, an extension of TensorFlow for structured data that register (as of one) one custom operation here through a macro defined here that calls REGISTER_OP, and then in Python it loads the library and register its gradient here through their own registration function defined here that simply calls tf.NotDifferentiable (another name for tf.NoGradient)
tldr: It is rather hard, but it can be done and there are even a couple of examples out there.
As mentioned in #jdehesa's comments. You can implement your function with an "alternative gradient". Forgive me if my math is not correct, but I think a derivative returning "1" would be the correct way to have no effect on the backpropagation while still passing the learning through. For how to construct it, see here. The example I cited goes further and allows you to construct an activation function from a python function. So in place of the spiky function, substitute your function a, and in place of his derivative d_spiky replace it with
def constant(x):
return 1
So on the forward pass, a is applied in the layer and the the backwards pass 1 is applied which should simply pass the weight adjustments through.
You can then just create an Activation layer in Keras using this function.

Why does TensorFlow have a lot of mathematical equations re-implemented?

I was looking through the API in TensorFlow and notice that a lot of mathematical operations that already exist in python and numpy have been re-implemented (or at least given a tensorflow interface). For example:
is there a good reason to do this?
I've been searching over their page but can't find why they'd do this.
I do have some guesses though. One of my main guesses is that they probably want those operations to have some backpropagation effect on whatever Neural network graph that gets implementat. In other words, have their derivatives implemented. Is this one of the reasons? (wish I knew how to even check if my guess is right)
For example, in one of the most basic examples of linear regression, one defines the prediction function that one wants to implement:
product = tf.matmul(x,W)
y = product + b
instead of
product = tf.matmul(x,W)
y = tf.add(product, b)
Somehow the first implementation does not interfere with Stochastic Gradient Descent algorithm for training, so it probably doesn't matter if one uses numpy or tf.add to train? This is one aspect that confuses me, when do I know which one should I be using.
Or maybe they are performance reasons? Or maybe its to give those operations access to GPU if required to use GPUs?
You have to understand that you create a tensorflow graph with this operation, meaning they aren't the same as the numpy functions, they are more an abstraction of them.
Maybe you have noticed that you have to create a session and then evaluate the functions through that session to get a result, where with numpy functions they are executed directly. this is because this graph and its functions define what to do like writing down a formula, but to get results for a specific x (or whatever) you have to insert a value for x. This is what your doing through session and eval.
So to conclude this you define a graph with tensorflow which is a more abstract representation of the functions and the graph also isn't executed at runtime, then it is defined, it will be executed when you call the eval function and through that run the session.
Also notice that you cant mix numpy functions and tensorflow functions directly but you can define own tensorflow functions (https://www.tensorflow.org/versions/r0.9/how_tos/adding_an_op/index.html)
Btw I guess most of the tensorflow functions are using numpy under the hood. :)