What is _uses_learning_phase in Keras? - tensorflow

I'm trying to write my own recurrent layer in Keras and noticed this line in the Keras source:
# Properly set learning phase on output tensor.
if 0 < self.dropout + self.recurrent_dropout:
if training is None:
output._uses_learning_phase = True
Checking the backend code for in_train_phase:
if training is None:
training = learning_phase()
uses_learning_phase = True
else:
uses_learning_phase = False
This is rather confusing. Isn't "training" the "learning phase"?! I guess more importantly, do I need to set _uses_learning_phase on output in my custom recurrent layer?

Intro
A "Training Flag" is meant to enable a Model (or Layer) to behave different from training when it predicts results or is being tested.
Depending on the backend used, Keras may need to implement its own boolean "training flag" (on CNTK as for Keras 2.2.4) or can use a native backend tensor (like with Tensorflow) Therefor dynamic-purpose code was integrated.
As consequence Layer class has a property described as followed:
uses_learning_phase: Whether any operation
of the layer uses `K.in_training_phase()`
or `K.in_test_phase()`.
and output tensors may be given an attribute _uses_learning_phase which is read by the property. If any output tensor has the attribute (and it is true), the layer's property returns true.
Usage in Keras's Recurrent layer
Your code snippet comes from keras/layers/recurrent.py and when calling the private _generate_dropout_mask method, the backend's operation creator "in_train_phase()" is being called. Therefore the output tensor's flag "_uses_learning_phase" is being set.
Explanation of quoted backend code
in_training_phase() and in_test_phase() are just the same. "training" is an optional argument and references the Training Flag. If the argument is not given, the Training Flag is refered automatically at
training = learning_phase()
However, the output tensor's attribute _uses_learning_phase is only set (and set True), if Training Flag is a tensor of the backend AND optional training argument was not set. (This may also explain, why a layer needs to set _uses_learning_phase itself, but I see no usecase for creating an operation via in_test_phase without flagging the output tensor. For now, assume there is one.)

Related

Clarification on Tensorflow 2.0 Masking

From the Tensorflow documentation when using Keras subclassing API, they give this example on how to pass a mask along to other layers that implement masking. I am wondering if this is explicitly required or if it is handled correctly after the Embedding layer has mask_zero=True.
class MyLayer(layers.Layer):
def __init__(self, **kwargs):
super(MyLayer, self).__init__(**kwargs)
self.embedding = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
self.lstm = layers.LSTM(32)
def call(self, inputs):
x = self.embedding(inputs)
# Note that you could also prepare a `mask` tensor manually.
# It only needs to be a boolean tensor
# with the right shape, i.e. (batch_size, timesteps).
mask = self.embedding.compute_mask(inputs)
output = self.lstm(x, mask=mask) # The layer will ignore the masked values
return output
layer = MyLayer()
x = np.random.random((32, 10)) * 100
x = x.astype('int32')
layer(x)
My confusion comes from another area of the documentation which states:
Masking
This layer supports masking for input data with a variable number of
timesteps. To introduce masks to your data, use an Embedding layer
with the mask_zero parameter set to True.
Which seems to mean that if mask_zero=True no further commands need to be done on subsequent layers.
If you read about the Masking layer, it will also support that once you used the mask at the beginning, all the rest of the layers get the mask automatically.
Quote:
For each timestep in the input tensor (dimension #1 in the tensor), if all values in the input tensor at that timestep are equal to mask_value, then the timestep will be masked (skipped) in all downstream layers (as long as they support masking).
If any downstream layer does not support masking yet receives such an input mask, an exception will be raised.
This other link also states the same. The mask will be propagated to all layers.
Quote:
When using the Functional API or the Sequential API, a mask generated by an Embedding or Masking layer will be propagated through the network for any layer that is capable of using them (for example, RNN layers). Keras will automatically fetch the mask corresponding to an input and pass it to any layer that knows how to use it.
The second link is really full of details on masking.
Notice that the code you showed is for a custom embedding. If teaches you how to "create and pass" a mask, if you want to create a layer that will create a mask. It's basically showing what the normal Embedding layer does.
So, we can conclude that if you're using a normal Embedding layer, all you need is mask_zero=True and everything will go down the stream.
In addition to the high-level answer given, let's have a look at some important technical details.
In case of doubts inspect the masking source code, to understand how it works.
Masking adds a _keras_mask attribute to the tensor, which flags entries to be skipped, effectively letting other API methods know about it.
Test yourself if a layer supports the mask, via supports_masking attribute. Example: tf.keras.layers.GlobalMaxPool1D().supports_masking
Masking logic is: skip a timestep if all features are equal to the masked value (TF source code uses not_equal and any to flag what remains)
import tensorflow ast f
arr = np.arange(6).reshape((1,6,1))
arr_masked = tf.keras.layers.Masking(mask_value=5)(arr)
print(arr_masked._keras_mask)
print(arr_masked.numpy())
I think you have to pass the mask from layer to layer in a subclassing layer.
From the Tensorflow documentation: Quote
Note that in the call method of a subclassed model or layer, masks aren't automatically propagated, so you will need to manually pass a mask argument to any layer that needs one.

How to disable e.g. Dropout in tf.keras.Model to generate activation maximation images using transfer learning

I am using transfer learning and keras.applications.InceptionV3. I manage to train the model successfully.
However, when I want to generate "activation maximisation" images (e.g. the input image that maximizes the activation of one of the custom classes, ref eg https://arxiv.org/pdf/1512.02017v3.pdf ) I struggle to use the pre-trained model since I do manage to use it in "fit" mode and disable all dropouts etc.
What I do is that I combine the pre-trained model in a tf.keras.Sequential to do gradient descent on the weights of the first layer (the input image).
Despite setting base_model.trainable = False however it seems as if the pre-trained model is put into training mode (although weights are not updated) when using model.fit(data) on the outer sequential model.
Is there any way to force the base_model (a child of a Sequential) to be in "predict" mode when calling fit on the outer?
I just came across the same question. After reading some documentation and having a look on the source code of TensorFlows implementations of tf.keras.layers.Layer, tf.keras.layers.Dense, and tf.keras.layers.BatchNormalization I got the following understanding.
If training = False is passed on calling the layer, it will run in inference mode. This has nothing to do with the attribute trainable, which means something different. It would probably lead to less misunderstanding, if they would have called it training_mode instead.
When doing Transfer Learning or Fine Tuning training = False should be passed on calling the base model itself. As far as I saw until now this will only affect layers like tf.keras.layers.Dropout and tf.keras.layers.BatchNormalization and will have not effect on the other layers.
Running in inference mode via training = False will result in tf.layers.Dropout not to apply the dropout rate at all.
As tf.layers.Dropout has no trainable weights, setting the attribute trainable = False will have no effect at all,

What does `training=True` mean when calling a TensorFlow Keras model?

In TensorFlow's offcial documentations, they always pass training=True when calling a Keras model in a training loop, for example, logits = mnist_model(images, training=True).
I tried help(tf.keras.Model.call) and it shows that
Help on function call in module tensorflow.python.keras.engine.network:
call(self, inputs, training=None, mask=None)
Calls the model on new inputs.
In this case `call` just reapplies
all ops in the graph to the new inputs
(e.g. build a new computational graph from the provided inputs).
Arguments:
inputs: A tensor or list of tensors.
training: Boolean or boolean scalar tensor, indicating whether to run
the `Network` in training mode or inference mode.
mask: A mask or list of masks. A mask can be
either a tensor or None (no mask).
Returns:
A tensor if there is a single output, or
a list of tensors if there are more than one outputs.
It says that training is a Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode. But I didn't find any information about this two modes.
In a nutshell, I don't know what is the influence of this argument. And what if I missed this argument when training?
Some neural network layers behave differently during training and inference, for example Dropout and BatchNormalization layers. For example
During training, dropout will randomly drop out units and correspondingly scale up activations of the remaining units.
During inference, it does nothing (since you usually don't want the randomness of dropping out units here).
The training argument lets the layer know which of the two "paths" it should take. If you set this incorrectly, your network might not behave as expected.
Training indicating whether the layer should behave in training mode or in inference mode.
training=True: The layer will normalize its inputs using the mean and variance of the current batch of inputs.
training=False: The layer will normalize its inputs using the mean and variance of its moving statistics, learned during training.
Usually in inference mode training=False, but in some networks such as pix2pix_cGAN‍‍‍‍‍‍ At both times of inference and training, training=True.

There is no "name" variable in the constructor of BasicLSTMCell

In order to differentiate LSTMs, I wish to give a name to the BasicLSTMCell variable in my code. But it reported the following error:
num_units=self.config.num_lstm_units, state_is_tuple=True, name="some_basic_lstm")
TypeError: __init__() got an unexpected keyword argument 'name'
And I found in the library of my tensorflow installation. Int the file rnn_cell_impl.py:
class BasicLSTMCell(RNNCell):
"""Basic LSTM recurrent network cell.
The implementation is based on: http://arxiv.org/abs/1409.2329.
We add forget_bias (default: 1) to the biases of the forget gate in order to
reduce the scale of forgetting in the beginning of the training.
It does not allow cell clipping, a projection layer, and does not
use peep-hole connections: it is the basic baseline.
For advanced models, please use the full #{tf.nn.rnn_cell.LSTMCell}
that follows.
"""
def __init__(self, num_units, forget_bias=1.0,
state_is_tuple=True, activation=None, reuse=None):
"""Initialize the basic LSTM cell.
Args:
num_units: int, The number of units in the LSTM cell.
forget_bias: float, The bias added to forget gates (see above).
Must set to `0.0` manually when restoring from CudnnLSTM-trained
checkpoints.
state_is_tuple: If True, accepted and returned states are 2-tuples of
the `c_state` and `m_state`. If False, they are concatenated
along the column axis. The latter behavior will soon be deprecated.
activation: Activation function of the inner states. Default: `tanh`.
reuse: (optional) Python boolean describing whether to reuse variables
in an existing scope. If not `True`, and the existing scope already has
the given variables, an error is raised.
Is it a bug in my version of tensorflow? How can I give it a "name"?
I think #aswinids provided the best answer here in comments, but let me explain why it is should not be considered a bug. An LSTM cell is comprised of at least 4 variables (there are a few others used for control flow and such). There are 4 sub-network operations that occur in an LSTM. The diagram below from Colah's blog illustrates the internals of an LSTM cell (http://colah.github.io/posts/2015-08-Understanding-LSTMs/):
Each of the yellow boxes has a set of weights assigned to it and is effectively a single layer neural network operation (piped together in an interesting way, defined by the LSTM architecture).
A good approach to naming these would then be tf.variable_scope('some_name') such that all 4 of the variables defined in the LSTM have a common base naming structure such as:
lstm_cell/f_t
lstm_cell/i_t
lstm_cell/C_t
lstm_cell/o_t
I suspect that previously they just did this and hard coded lstm_cell or whatever name they used as the prefix for all the variables under the LSMT cell. In the later versions as #ashwinids points out, there is a name variable and I suspect that just replaced lstm_cell I used in the example here.

Tensorflow RNN weight matrices initialization

I'm using bidirectional_rnn with GRUCell but this is a general question regarding the RNN in Tensorflow.
I couldn't find how to initialize the weight matrices (input to hidden, hidden to hidden). Are they initialized randomly? to zeros? are they initialized differently for each LSTM I create?
EDIT: Another motivation for this question is in pre-training some LSTMs and using their weights in a subsequent model. I don't currently know how to do that currently without saving all the states and restoring the entire model.
Thanks.
How to initialize weight matrices for RNN?
I believe people are using random normal initialization for weight matrices for RNN. Check out the example in TensorFlow GitHub Repo. As the notebook is a bit long, they have a simple LSTM model where they use tf.truncated_normal to initialize weights and tf.zeros to initialize biases (although I have tried using tf.ones to initialize biases before, seem to also work). I believe that the standard deviation is a hyperparameter you could tune yourself. Sometimes weights initialization is important to the gradient flow. Although as far as I know, LSTM itself is designed to handle gradient vanishing problem (and gradient clipping is for helping gradient exploding problem), so perhaps you don't need to be super careful with the setup of std_dev in LSTM? I've read papers recommending Xavier initialization (TF API doc for Xavier initializer) in Convolution Neural Network context. I don't know if people use that in RNN, but I imagine you can even try those in RNN if you want to see if it helps.
Now to follow up with #Allen's answer and your follow up question left in the comments.
How to control initialization with variable scope?
Using the simple LSTM model in the TensorFlow GitHub python notebook that I linked to as an example.
Specifically, if I want to re-factorize the LSTM part of the code in above picture using variable scope control, I may code something as following...
import tensorflow as tf
def initialize_LSTMcell(vocabulary_size, num_nodes, initializer):
'''initialize LSTMcell weights and biases, set variables to reuse mode'''
gates = ['input_gate', 'forget_gate', 'memory_cell', 'output_gate']
with tf.variable_scope('LSTMcell') as scope:
for gate in gates:
with tf.variable_scope(gate) as gate_scope:
wx = tf.get_variable("wx", [vocabulary_size, num_nodes], initializer)
wt = tf.get_variable("wt", [num_nodes, num_nodes], initializer)
bi = tf.get_variable("bi", [1, num_nodes, tf.constant_initializer(0.0)])
gate_scope.reuse_variables() #this line can probably be omitted, b.z. by setting 'LSTMcell' scope variables to 'reuse' as the next line, it'll turn on the reuse mode for all its child scope variables
scope.reuse_variables()
def get_scope_variables(scope_name, variable_names):
'''a helper function to fetch variable based on scope_name and variable_name'''
vars = {}
with tf.variable_scope(scope_name, reuse=True):
for var_name in variable_names
var = tf.get_variable(var_name)
vars[var_name] = var
return vars
def LSTMcell(i, o, state):
'''a function for performing LSTMcell computation'''
gates = ['input_gate', 'forget_gate', 'memory_cell', 'output_gate']
var_names = ['wx', 'wt', 'bi']
gate_comp = {}
with tf.variable_scope('LSTMcell', reuse=True):
for gate in gates:
vars = get_scope_variables(gate, var_names)
gate_comp[gate] = tf.matmul(i, vars['wx']) + tf.matmul(o, vars['wt']) + vars['bi']
state = tf.sigmoid(gate_comp['forget_gate']) * state + tf.sigmoid(gate_comp['input_gate']) * tf.tanh(gate_comp['memory_cell'])
output = tf.sigmoid(gate_comp['output_gate']) * tf.tanh(state)
return output, state
The usage of the re-factorized code would be something like following...
initialize_LSTMcell(volcabulary_size, num_nodes, tf.truncated_normal_initializer(mean=-0.1, stddev=.01))
#...Doing some computation...
LSTMcell(input_tensor, output_tensor, state)
Even though the refactorized code may look less straightforward, but using scope variable control ensures scope encapsulation and allows flexible variable controls (in my opinion at least).
In pre-training some LSTMs and using their weights in a subsequent model. How to do that without saving all the states and restoring the entire model.
Assuming you have a pre-trained model froze and loaded in, if you wanna use their frozen 'wx', 'wt' and 'bi', you can simply find their parent scope names and variable names, then fetch the variables using similar structure in get_scope_variables func.
with tf.variable_scope(scope_name, reuse=True):
var = tf.get_variable(var_name)
Here is a link to understanding variable scope and sharing variables. I hope this is helpful.
The RNN models will create their variables with get_variable, and you can control the initialization by wrapping the code which creates those variables with a variable_scope and passing a default initializer to it. Unless the RNN specifies one explicitly (looking at the code, it doesn't), uniform_unit_scaling_initializer is used.
You should also be able to share model weights by declaring the second model and passing reuse=True to its variable_scope. As long as the namespaces match up, the new model will get the same variables as the first model.
A simple way to initialize all kernel weights with certain initializer is to leave the initializer in tf.variable_scope(). For example:
with tf.variable_scope('rnn', initializer=tf.variance_scaling_initializer()):
basic_cell= tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, state= tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)