Which operations support automatic gradients? - tensorflow

I have a fairly complex quantisation layer in my network which contains, among other operations, tf.tile and tf.expand_dims ops. I noticed that my network did not train well. Looking at some debug output, I saw that the fully connected layer before this quantisation layer got zero gradients for its weights (I used optimizer.compute_gradients to determine this). Does this mean that what ever is before the quantisation layer does not update in training?
In general: How do I figure out which operations let gradients pass through and which do not? For instance, do the above mentionied tf.tile and tf.expand_dims let gradients pass through`

If there is an operation without gradients in your model you will get an error:
LookupError: No gradient defined for operation [...]
So your problem seems to be somewhere else, maybe you have a multiplication by zero somewhere which kills the gradients. There is not enough information in your question to find the real reason of your problem.
Edit:
I didn't directly answer the question which operations support automatic gradients.
It is not listed in the documentation and I think you can only see it by checking the source code or using the operation and see if you get the mentioned error when you try to optimize the model.
For tf.tile and tf.expand_dims there are gradients defined.

Related

RNN/GRU Increasing validation loss but decreasing mean absolute error

I am new to deep learning and I try to implement an RNN (with 2 GRU layers).
At first, the network seems to do it's job quite fine. However, I am currently trying to understand the loss and accuracy curve. I attached the pictures below. The dark-blue line is the training set and the cyan line is the validation set.
After 50 epochs the validation loss increases. My assumption is that this indicates overfitting. However, I am unsure why the validation mean absolute error still decreases. Do you maybe got an idea?
One idea I had in mind was that this could be caused by some big outliers in my dataset. Thus I already tried to clean it up. I also tried to scale it properly. I also added a few dropout layers for further regularization (rate=0.2). However these are just normal dropout layers because cudnn does not seem to support recurrent_dropout from tensorflow.
Remark: I am using the negative log-likelihood as loss function and a tensorflow probability distribution as the output dense layer.
Any hints what I should investigate?
Thanks in advance
Edit: I also attached the non-probabilistic plot as recommended in the comment. Seems like here the mean-absolute-error behaves normal (does not improve all the time).
What are the outputs of your model? It sounds pretty strange that you're using the negative log-likelihood (which basically "works" with distributions) as the loss function but the MAE as a metric, which is suited for deterministic continuous values.
I don't know what is your task and perhaps this is meaningful in your specific case, but perhaps the strange behavior comes out from there.

Behaviour of Alpha Dropout in Training and Inference time

I am in the process of implementing a self normalizing neural network using the tensorflow. There are currently tensorflow "primitives" in the form of tf.nn.selu and tf.contrib.nn.alpha_dropout that should make this an easy process.
My problem is with tf.contrib.nn.alpha_dropout. I was expecting it to have a boolean switch for when you are in training and when you are in inference as does the usual dropout function used with other activation functions.
In the original implementation by the authors, we indeed see that they have this boolean switch (training) in the selu dropout function (dropout_selu).
Is there something I am missing?
tf.contrib.nn.alpha_dropout should be seen as an analogue to tf.nn.dropout. The latter function also does not have an argument for a training switch. It is not to be confused with tf.layers.dropout, which wraps tf.nn.dropout and has a training argument. As we can see in the implementation, the layers version returns either the result of nn.dropout or the identity depending on the training switch. It should be relatively easy to define your own wrapper around alpha_dropout in a similar manner.
To avoid any confusion: layers.dropout eventually calls the "keras layers" version of dropout which is the implementation linked above.

Does Tensorflow support Keras models fit() method with eager execution?

I am training a Keras model (tf.keras.models.Sequential) calling its method fit().
Since I enabled eager execution, training time (for the same number of epochs) went up from 20.1s to 49.4s. Also, training didn't seem to converge anymore, as loss remained around 9 (without eager execution it went down to 1), while method fit() didn't even report the requested metric "accuracy" anymore.
Is eager execution support for Keras models? Note that I am calling method fit() on the model, not using an estimator.
Here the snippet of code that declares the model and does the training. Using TF 1.7 for GPU installed with pip3.
tf.enable_eager_execution()
model = tf.keras.models.Sequential([
tf.keras.layers.InputLayer(input_shape=(11,)) ,
tf.keras.layers.Dense(64, activation='relu') ,
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(11, activation='softmax')
])
optimizer = tf.train.AdamOptimizer()
# optimizer = 'adam'
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x=train_X, y=train_y, epochs=200, batch_size=64, verbose=2)
UPDATE: filed issue #18642 on Tensorflow GITHUB.
The issue I reported on tensorflow got this answer:
Thank you for the bug report. We have a fix for this issue, that will
show up on GitHub soon.
See issue #18642 on GITHUB for Tensorflow.
Based on this, I understand that method fit() of Keras models will be supported with eager execution, once the bug is fixed.
Here is a quote from the Tensorflow site found here
There are many parameters to optimize when calculating derivatives. TensorFlow code is easier to read when structured into reusable classes and objects instead of a single top-level function. Eager execution encourages the use of the Keras-style layer classes in the tf.keras.layers module. Additionally, the tf.train.Optimizer classes provide sophisticated techniques to calculate parameter updates.
That means keras layers and subsequent models are allowed using Eager execution.
As for your timing, the link also mentions how using eager stops building of graphs.
TensorFlow's eager execution is an imperative programming environment that evaluates operations immediately, without an extra graph-building step. Operations return concrete values instead of constructing a computational graph to run later.
This may make it harder for your model to run given the number of DENSE layers you have. Someone may correct me on that because I have not done much work with DENSE layers before, or it has been a long time since I have. If that does not work then I would look into your loss function. This answer may help if that becomes a problem.
Everything else looks alright though. Hope this helps.
EDIT
Ok I see what you are saying Fate. Yeah the first link uses Sequential model, but Gradient tape fro gradient decent. Reading deeper into the eager tutorial shows that they only use Gradient tape as well. Here is what the tutorial says about training:
Automatic differentiation is useful for implementing machine learning algorithms such as backpropagation for training neural networks. During eager execution, use tfe.GradientTape to trace operations for computing gradients later.tfe.GradientTape is an opt-in feature to provide maximal performance when not tracing. Since different operations can occur during each call, all forward-pass operations get recorded to a "tape". To compute the gradient, play the tape backwards and then discard. A particular tfe.GradientTape can only be computed once, subsequent calls throw a runtime error.
So maybe as of right now only Gradient tape and the estimator method are what you are supposed to use with eager.
When reading the compile method on Model (documentation), you can find an argument, run_eagerly:
run_eagerly: Bool. Defaults to False. If True, this Model's logic will not be wrapped in a tf.function. Recommended to leave this as None unless your Model cannot be run inside a tf.function.
So by default, a tf.keras.Model will default to running through graph execution, not eager execution.

Tensorflow. Conditionally trainable variables and stochastic depth neural networks

I have come across with a problem when started implementing stochastic depth regularization approch using Tensorflow. The paper (https://arxiv.org/pdf/1603.09382.pdf) states that the model can converge faster if we drop randomly some residual units during training. Current Torch implementation works perfectly. In Tensoflow, I can put conditions on residual unit branches, so that during forward step the activations for it will be cancelled, but the weights still will be updated during backward step. There is no way to tell that these weights (in the residual branch which we cancelled) are no longer trainable and they should not be included in optimization for current session run.
I have created the issue on github, where I covered how this problem can be solved in naive way, of course there is something under-hood that will prevent applying an easy fix, otherwise it is really strange why the tf.Variable's trainable parameter does not allow boolean Tensor as a value. If someone has clue for this question, I would really appreciate if you restore my faith in Tensoflow :)
The trainable parameter is used to control whether a graph to train that variable is built or not. Using a conditional stopgradient (a tf.cond with tf.identity in one branch and tf.stopgradient on the other) will deal with stopping the gradient from that variable.
However, if its value was not used during a forward step the gradient computed is guaranteed to be 0, and hence the update will be a no-op.

skipping layer in backpropagation in keras

I am using Keras with tensorflow backend and I am curious whether it is possible to skip a layer during backpropagation but have it execute in the forward pass. So here is what I mean
Lambda (lambda x: a(x))
I want to apply a to x in the forward pass but I do not want a to be included in the derivation when the backprop takes place.
I was trying to find a solution bit I could not find anything. Can somebody help me out here?
UPDATE 2
In addition to tf.py_func, there is now an official guide on how to add a custom op.
UPDATE
See this question for an example of writing a custom op with gradient purely in Python without needing to rebuild anything. Note that there are some limitations to the method (see the documentation of tf.py_func).
Not exactly a solution to the problem, but still kind of an answer and too long for comments.
That's not even a Keras issue, but a TensorFlow one. Each op defines its own gradient computation that is used during backpropagation. I you really wanted to something like that, you would need to implement the op into TensorFlow yourself (no easy feat) and define the gradient that you want - because you can't have "no gradient", if anything it would be 1 or 0 (otherwise you can't go on with backpropagation). There is a tf.NoGradient function in TensorFlow which causes an op to propagate zeros, but I don't think it is meant to / can be used out of TensorFlow own internals.
UPDATE
Okay so a bit more of context. TensorFlow graphs are built of ops, which are implemented by kernels; this is basically a 1-to-1 mapping, except that there may be for example a CPU and a GPU kernel for an op, hence the differentiation. The set of ops supported by TensorFlow is usually static, I mean it can change with newer versions, but in principle you cannot add your own ops, because the ops of a graph go into the Protobuf serialized format, so if you made your own ops then you would not be able to share your graph. Ops are then defined at C++ level with the macro REGISTER_OP (see for example here), and kernels with REGISTER_KERNEL_BUILDER (see for example here).
Now, where do gradients come into play? Well, the funny thing is that the gradient of an op is not defined at C++ level; there are ops (and kernels) that implement the gradient of other ops (if you look at the previous files you'll find ops/kernels with the name ending in Grad), but (as far as I'm aware) these are not explicitly "linked" at this level. It seems that the associations between ops and their gradients is defined in Python, usually via tf.RegisterGradient or the aforementioned tf.NoGradient (see for example here, Python modules starting with gen_ are autogenerated with the help of the C++ macros); these registrations inform the backpropagation algorithm about how to compute the gradient of the graph.
So, how to actually work this out? Well, you need to create at least one op in C++ with the corresponding kernel/s implementing the computation that you want for your forward pass. Then, if the gradient computation that you want to use can be expressed with existing TensorFlow ops (which is most likely), you would just need to call tf.RegisterGradient in Python and do the computation there in "standard" TensorFlow. This is quite complicated, but the good news is it's possible, and there's even an example for it (although I think they kinda forgot the gradient registration part in that one)! As you will see, the process involves compiling the new op code into a library (btw I'm not sure if any of this may work on Windows) that is then loaded from Python (obviously this involves going through the painful process of manual compilation of TensorFlow with Bazel). A possibly more realistic example can be found in TensorFlow Fold, an extension of TensorFlow for structured data that register (as of one) one custom operation here through a macro defined here that calls REGISTER_OP, and then in Python it loads the library and register its gradient here through their own registration function defined here that simply calls tf.NotDifferentiable (another name for tf.NoGradient)
tldr: It is rather hard, but it can be done and there are even a couple of examples out there.
As mentioned in #jdehesa's comments. You can implement your function with an "alternative gradient". Forgive me if my math is not correct, but I think a derivative returning "1" would be the correct way to have no effect on the backpropagation while still passing the learning through. For how to construct it, see here. The example I cited goes further and allows you to construct an activation function from a python function. So in place of the spiky function, substitute your function a, and in place of his derivative d_spiky replace it with
def constant(x):
return 1
So on the forward pass, a is applied in the layer and the the backwards pass 1 is applied which should simply pass the weight adjustments through.
You can then just create an Activation layer in Keras using this function.