Why is L2 regularization not added back into original loss function? - tensorflow

I'm aware that when using a kernal regularizer, particularly, l2 loss, I should bee add it back into the loss function and this is what is being done in other posts. However, in Keras, they are not following this process. Why is this so?
For instance, consider this and this notebook. They are using l2 loss as a kernal regularizer in some layers but not adding back into the original loss. Is this because of the particular loss, or is this a behavior followed in just Keras or am I completely misunderstanding everything?

Keras hides a lot of complexity (and this is not always a good thing).
You're using the Model abstraction: this model contains inside all the required information about the architecture and the training procedure.
When you invoke the method compile or train or train_on_batch you specify the loss function but under the hood what happens is:
Instantiate the loss function specified (e.g. categorical cross entropy)
Fetch from the model the regularizations applied and add all of them to the loss term previously instantiated
You can see the operations that are going to be added to the loss term accessing to the property .losses of the model instance (that's a list of tensorflow operations, usually all multilication operations, since the regularizations are in the for regularization_strenght * norm_p(variable).

The L2 regularization (or any weight regularization) in Keras is still added to the loss function in the same way as you would expect. It just happens behind the scene, so the user doesn't need to worry about it.
The notebooks you linked are the right way to use weight regularization in Keras.

Related

Using multiple losses and multiple training steps in a TF2 model using subclassing?

I am implementing a generative adversarial autoencoder in TF2. I have got it working but not optimally and could use some high level advise to improve it.
The model is inspired by the paper “Adversarial Factorization Autoencoder for Look-alike Modeling” https://dmkd.cs.vt.edu/papers/CIKM19.pdf
The model consists of three parts: an encoder/generator, a decoder and a discriminator.
I have implemented these three parts as custom classes, each subclassing from tf.keras.Model (which is primary source of my issues). I have three different loss functions (autoencoder loss, generator loss, discriminator loss) and two custom training step functions.
The first training function trains first the autoencoder (using the autoencoder loss function) and then the generator (using the generator loss function). The generator is just the encoder part of the autoencoder but having the double purpose of also fooling the discriminator).
The second training function trains the discriminator using the discriminator loss function.
This approach works ok but subclassing all three parts from tf.keras.Model has limitations. I can’t utilize keras compile and fit functionalities. Callbacks are a nightmare and I really do need early stopping, keep best model, tensorboard integration and so on. It appears the best approach is to subclass each part from tf.keras.layers.Layer and then combine them in a single custom model. But I am not sure if it is at all possible to wire up multiple loss functions and multiple training steps to different layer blocks of a custom model?
Any hints and insights are greatly appreciated.

Behaviour of Alpha Dropout in Training and Inference time

I am in the process of implementing a self normalizing neural network using the tensorflow. There are currently tensorflow "primitives" in the form of tf.nn.selu and tf.contrib.nn.alpha_dropout that should make this an easy process.
My problem is with tf.contrib.nn.alpha_dropout. I was expecting it to have a boolean switch for when you are in training and when you are in inference as does the usual dropout function used with other activation functions.
In the original implementation by the authors, we indeed see that they have this boolean switch (training) in the selu dropout function (dropout_selu).
Is there something I am missing?
tf.contrib.nn.alpha_dropout should be seen as an analogue to tf.nn.dropout. The latter function also does not have an argument for a training switch. It is not to be confused with tf.layers.dropout, which wraps tf.nn.dropout and has a training argument. As we can see in the implementation, the layers version returns either the result of nn.dropout or the identity depending on the training switch. It should be relatively easy to define your own wrapper around alpha_dropout in a similar manner.
To avoid any confusion: layers.dropout eventually calls the "keras layers" version of dropout which is the implementation linked above.

Does Tensorflow support Keras models fit() method with eager execution?

I am training a Keras model (tf.keras.models.Sequential) calling its method fit().
Since I enabled eager execution, training time (for the same number of epochs) went up from 20.1s to 49.4s. Also, training didn't seem to converge anymore, as loss remained around 9 (without eager execution it went down to 1), while method fit() didn't even report the requested metric "accuracy" anymore.
Is eager execution support for Keras models? Note that I am calling method fit() on the model, not using an estimator.
Here the snippet of code that declares the model and does the training. Using TF 1.7 for GPU installed with pip3.
tf.enable_eager_execution()
model = tf.keras.models.Sequential([
tf.keras.layers.InputLayer(input_shape=(11,)) ,
tf.keras.layers.Dense(64, activation='relu') ,
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(11, activation='softmax')
])
optimizer = tf.train.AdamOptimizer()
# optimizer = 'adam'
model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(x=train_X, y=train_y, epochs=200, batch_size=64, verbose=2)
UPDATE: filed issue #18642 on Tensorflow GITHUB.
The issue I reported on tensorflow got this answer:
Thank you for the bug report. We have a fix for this issue, that will
show up on GitHub soon.
See issue #18642 on GITHUB for Tensorflow.
Based on this, I understand that method fit() of Keras models will be supported with eager execution, once the bug is fixed.
Here is a quote from the Tensorflow site found here
There are many parameters to optimize when calculating derivatives. TensorFlow code is easier to read when structured into reusable classes and objects instead of a single top-level function. Eager execution encourages the use of the Keras-style layer classes in the tf.keras.layers module. Additionally, the tf.train.Optimizer classes provide sophisticated techniques to calculate parameter updates.
That means keras layers and subsequent models are allowed using Eager execution.
As for your timing, the link also mentions how using eager stops building of graphs.
TensorFlow's eager execution is an imperative programming environment that evaluates operations immediately, without an extra graph-building step. Operations return concrete values instead of constructing a computational graph to run later.
This may make it harder for your model to run given the number of DENSE layers you have. Someone may correct me on that because I have not done much work with DENSE layers before, or it has been a long time since I have. If that does not work then I would look into your loss function. This answer may help if that becomes a problem.
Everything else looks alright though. Hope this helps.
EDIT
Ok I see what you are saying Fate. Yeah the first link uses Sequential model, but Gradient tape fro gradient decent. Reading deeper into the eager tutorial shows that they only use Gradient tape as well. Here is what the tutorial says about training:
Automatic differentiation is useful for implementing machine learning algorithms such as backpropagation for training neural networks. During eager execution, use tfe.GradientTape to trace operations for computing gradients later.tfe.GradientTape is an opt-in feature to provide maximal performance when not tracing. Since different operations can occur during each call, all forward-pass operations get recorded to a "tape". To compute the gradient, play the tape backwards and then discard. A particular tfe.GradientTape can only be computed once, subsequent calls throw a runtime error.
So maybe as of right now only Gradient tape and the estimator method are what you are supposed to use with eager.
When reading the compile method on Model (documentation), you can find an argument, run_eagerly:
run_eagerly: Bool. Defaults to False. If True, this Model's logic will not be wrapped in a tf.function. Recommended to leave this as None unless your Model cannot be run inside a tf.function.
So by default, a tf.keras.Model will default to running through graph execution, not eager execution.

backpropagation issues with a custom layer (TF/Keras)

I've been working on a prototype and I am having issues with backpropagation.I am currently using the latest keras and tensorflow build ( as tensorflow as a backend, I have looked into cntk, mxnet, and chainer; so far only chainer would allow me to do it but the training time is quite slow..)
My current layer is similar to a convolutional layer with more operations than a simple multiplication.
I know that tensorflow should use automatic differentiation if all the operations support it to calculate the gradient and perform gradient descent.
Currently my layer uses the following operator : reduce_sum, sum, subtraction, multiplication and division.
I also relies on the following methods: extract_image_patches, reshape, transpose.
I doubt any of these would cause an issue with automatic gradient descent. I built 2 layers as tests, one inherits from the base layer in keras while the other inherit directly from _Conv. In both cases whenever I use that layer anywhere in a model no weights are updated during the training process.
How could I solve this problem and fix backpropagation?
Edit:
(Here is the layer implementation https://github.com/roya0045/cvar2/blob/master/tfvar.py,
for the testing iteself see https://github.com/roya0045/cvar2/blob/master/test2.py )

Tensorflow. Conditionally trainable variables and stochastic depth neural networks

I have come across with a problem when started implementing stochastic depth regularization approch using Tensorflow. The paper (https://arxiv.org/pdf/1603.09382.pdf) states that the model can converge faster if we drop randomly some residual units during training. Current Torch implementation works perfectly. In Tensoflow, I can put conditions on residual unit branches, so that during forward step the activations for it will be cancelled, but the weights still will be updated during backward step. There is no way to tell that these weights (in the residual branch which we cancelled) are no longer trainable and they should not be included in optimization for current session run.
I have created the issue on github, where I covered how this problem can be solved in naive way, of course there is something under-hood that will prevent applying an easy fix, otherwise it is really strange why the tf.Variable's trainable parameter does not allow boolean Tensor as a value. If someone has clue for this question, I would really appreciate if you restore my faith in Tensoflow :)
The trainable parameter is used to control whether a graph to train that variable is built or not. Using a conditional stopgradient (a tf.cond with tf.identity in one branch and tf.stopgradient on the other) will deal with stopping the gradient from that variable.
However, if its value was not used during a forward step the gradient computed is guaranteed to be 0, and hence the update will be a no-op.