Are masks in Tensorflow automatically consumed by loss and metric? - tensorflow

This answer says:
If there's a mask in your model, it'll be propagated layer-by-layer
and eventually applied to the loss. So if you're padding and masking
the sequences in a correct way, the loss on the padding placeholders
would be ignored.
However in TensorFlow's tutorial on Transformers, the author has implemented custom loss and metric where masks are computed and applied internally. Is this necessary?
Note in the code of the Transformer model, the author has deleted the keras mask:
# Drop the keras mask, so it doesn't scale the losses/metrics.
# b/250038731
del logits._keras_mask
except AttributeError:
# Return the final output and the attention weights.
return logits
Do we need to implement a custom loss and metric with mask, or we can use the built-in ones?


Keras: Custom loss function with training data not directly related to model

I am trying to convert my CNN written with tensorflow layers to use the keras api in tensorflow (I am using the keras api provided by TF 1.x), and am having issue writing a custom loss function, to train the model.
According to this guide, when defining a loss function it expects the arguments (y_true, y_pred)
def basic_loss_function(y_true, y_pred):
return ...
However, in every example I have seen, y_true is somehow directly related to the model (in the simple case it is the output of the network). In my problem, this is not the case. How do implement this if my loss function depends on some training data that is unrelated to the tensors of the model?
To be concrete, here is my problem:
I am trying to learn an image embedding trained on pairs of images. My training data includes image pairs and annotations of matching points between the image pairs (image coordinates). The input feature is only the image pairs, and the network is trained in a siamese configuration.
I am able to implement this successfully with tensorflow layers and train it sucesfully with tensorflow estimators.
My current implementations builds a tf Dataset from a large database of tf Records, where the features is a dictionary containing the images and arrays of matching points. Before I could easily feed these arrays of image coordinates to the loss function, but here it is unclear how to do so.
There is a hack I often use that is to calculate the loss within the model, by means of Lambda layers. (When the loss is independent from the true data, for instance, and the model doesn't really have an output to be compared)
In a functional API model:
def loss_calc(x):
loss_input_1, loss_input_2 = x #arbirtray inputs, you choose
#according to what you gave to the Lambda layer
#here you use some external data that doesn't relate to the samples
externalData = K.constant(external_numpy_data)
#calculate the loss
return the loss
Using the outputs of the model itself (the tensor(s) that are used in your loss)
loss = Lambda(loss_calc)([model_output_1, model_output_2])
Create the model outputting the loss instead of the outputs:
model = Model(inputs, loss)
Create a dummy keras loss function for compilation:
def dummy_loss(y_true, y_pred):
return y_pred #where y_pred is the loss itself, the output of the model above
model.compile(loss = dummy_loss, ....)
Use any dummy array correctly sized regarding number of samples for training, it will be ignored:, np.zeros((number_of_samples,)), ...)
Another way of doing it, is using a custom training loop.
This is much more work, though.
Although you're using TF1, you can still turn eager execution on at the very beginning of your code and do stuff like it's done in TF2. (tf.enable_eager_execution())
Follow the tutorial for custom training loops:
Here, you calculate the gradients yourself, of any result regarding whatever you want. This means you don't need to follow Keras standards of training.
Finally, you can use the approach you suggested of model.add_loss.
In this case, you calculate the loss exaclty the same way I did in the first answer. And pass this loss tensor to add_loss.
You can probably compile a model with loss=None then (not sure), because you're going to use other losses, not the standard one.
In this case, your model's output will probably be None too, and you should fit with y=None.

Clarification on Tensorflow 2.0 Masking

From the Tensorflow documentation when using Keras subclassing API, they give this example on how to pass a mask along to other layers that implement masking. I am wondering if this is explicitly required or if it is handled correctly after the Embedding layer has mask_zero=True.
class MyLayer(layers.Layer):
def __init__(self, **kwargs):
super(MyLayer, self).__init__(**kwargs)
self.embedding = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
self.lstm = layers.LSTM(32)
def call(self, inputs):
x = self.embedding(inputs)
# Note that you could also prepare a `mask` tensor manually.
# It only needs to be a boolean tensor
# with the right shape, i.e. (batch_size, timesteps).
mask = self.embedding.compute_mask(inputs)
output = self.lstm(x, mask=mask) # The layer will ignore the masked values
return output
layer = MyLayer()
x = np.random.random((32, 10)) * 100
x = x.astype('int32')
My confusion comes from another area of the documentation which states:
This layer supports masking for input data with a variable number of
timesteps. To introduce masks to your data, use an Embedding layer
with the mask_zero parameter set to True.
Which seems to mean that if mask_zero=True no further commands need to be done on subsequent layers.
If you read about the Masking layer, it will also support that once you used the mask at the beginning, all the rest of the layers get the mask automatically.
For each timestep in the input tensor (dimension #1 in the tensor), if all values in the input tensor at that timestep are equal to mask_value, then the timestep will be masked (skipped) in all downstream layers (as long as they support masking).
If any downstream layer does not support masking yet receives such an input mask, an exception will be raised.
This other link also states the same. The mask will be propagated to all layers.
When using the Functional API or the Sequential API, a mask generated by an Embedding or Masking layer will be propagated through the network for any layer that is capable of using them (for example, RNN layers). Keras will automatically fetch the mask corresponding to an input and pass it to any layer that knows how to use it.
The second link is really full of details on masking.
Notice that the code you showed is for a custom embedding. If teaches you how to "create and pass" a mask, if you want to create a layer that will create a mask. It's basically showing what the normal Embedding layer does.
So, we can conclude that if you're using a normal Embedding layer, all you need is mask_zero=True and everything will go down the stream.
In addition to the high-level answer given, let's have a look at some important technical details.
In case of doubts inspect the masking source code, to understand how it works.
Masking adds a _keras_mask attribute to the tensor, which flags entries to be skipped, effectively letting other API methods know about it.
Test yourself if a layer supports the mask, via supports_masking attribute. Example: tf.keras.layers.GlobalMaxPool1D().supports_masking
Masking logic is: skip a timestep if all features are equal to the masked value (TF source code uses not_equal and any to flag what remains)
import tensorflow ast f
arr = np.arange(6).reshape((1,6,1))
arr_masked = tf.keras.layers.Masking(mask_value=5)(arr)
I think you have to pass the mask from layer to layer in a subclassing layer.
From the Tensorflow documentation: Quote
Note that in the call method of a subclassed model or layer, masks aren't automatically propagated, so you will need to manually pass a mask argument to any layer that needs one.

What does `training=True` mean when calling a TensorFlow Keras model?

In TensorFlow's offcial documentations, they always pass training=True when calling a Keras model in a training loop, for example, logits = mnist_model(images, training=True).
I tried help( and it shows that
Help on function call in module
call(self, inputs, training=None, mask=None)
Calls the model on new inputs.
In this case `call` just reapplies
all ops in the graph to the new inputs
(e.g. build a new computational graph from the provided inputs).
inputs: A tensor or list of tensors.
training: Boolean or boolean scalar tensor, indicating whether to run
the `Network` in training mode or inference mode.
mask: A mask or list of masks. A mask can be
either a tensor or None (no mask).
A tensor if there is a single output, or
a list of tensors if there are more than one outputs.
It says that training is a Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode. But I didn't find any information about this two modes.
In a nutshell, I don't know what is the influence of this argument. And what if I missed this argument when training?
Some neural network layers behave differently during training and inference, for example Dropout and BatchNormalization layers. For example
During training, dropout will randomly drop out units and correspondingly scale up activations of the remaining units.
During inference, it does nothing (since you usually don't want the randomness of dropping out units here).
The training argument lets the layer know which of the two "paths" it should take. If you set this incorrectly, your network might not behave as expected.
Training indicating whether the layer should behave in training mode or in inference mode.
training=True: The layer will normalize its inputs using the mean and variance of the current batch of inputs.
training=False: The layer will normalize its inputs using the mean and variance of its moving statistics, learned during training.
Usually in inference mode training=False, but in some networks such as pix2pix_cGAN‍‍‍‍‍‍ At both times of inference and training, training=True.

TensorFlow softmax_crossentropy_with logits: are "labels" also trained (if differentiable)?

The softmax cross-entropy with logits loss function is used to reduce the difference between the logits and labels provided to the function. Typically, the labels are fixed for supervised learning and the logits are adapted. But what happens when the labels come from a differentiable source, e.g., another network? Do both networks, i.e., the "logits network" and the "labels network" get trained by the subsequent optimizer, or does this loss function always treat the labels as fixed?
TLDR: Does tf.nn.softmax_cross_entropy_with_logits() also provide gradients for the labels (if they are differentiable), or are they always considered fixed?
You need to use tf.softmax_cross_entropy_with_logits_v2 to get gradients with respect to labels.
The gradient is calculated from loss provided to the optimizer, if the "labels" are coming from another trainable network, then yes, these will be modified, since they influence the loss. The correct way of using another networks outputs for your own is to define it as untrainable, or make a list of all variables you want to train and pass them to the optimizer explicitly.

Getting and editing gradient parameters in Caffe

Is it possible to get the gradients with respect to each layer in Caffe in CNNs, edit them and again apply the new gradients in the training process? If possible, using pycaffe interface.
For example in TensorFlow, it could be done by means of functions:
I'm not sure what you mean by "apply the new gradients in the training process", but you can access the gradients in the pycaffe interface:
import caffe
net = caffe.Net('/path/to/net.prototxt', '/path/to/weights.caffemodel', caffe.TEST)
# provide inputs to the net, do a pass so that meaningful data/gradients propagate to all the layers
# once data/gradients are updated, you can access them
net.blobs['blob_name'].diff # access the gradient of blob 'blob_name'
net.layers[5].blobs[0].diff # access the gradient of the first parameter blob of the 6th layer
To map between layer names and layer indices, you can use this code:
This will return the index of layer 'layer_name'.