TensorFlow 2 GRU Layer with multiple hidden layers - tensorflow

I am attempting to port some TensorFlow 1 code to TensorFlow 2. The old code used the now deprecated MultiRNNCell to create a GRU layer with multiple hidden layers. In TensorFlow 2 I want to use the in-built GRU Layer, but there doesn't seem to be an option which allows for multiple hidden layers with that class. The PyTorch equivalent has such an option exposed as an initialization parameter, num_layers.
My workaround has been to use the TensorFlow RNN layer and pass a GRU cell for each hidden layer I want - this is the way recommended in the docs:
dim = 1024
num_layers = 4
cells = [tf.keras.layers.GRUCell(dim) for _ in range(num_layers)]
gru_layer = tf.keras.layers.RNN(
cells,
return_sequences=True,
stateful=True
)
But the in-built GRU layer has support for CuDNN, which the plain RNN seems to lack, to quote the docs:
Mathematically, RNN(LSTMCell(10)) produces the same result as
LSTM(10). In fact, the implementation of this layer in TF v1.x was
just creating the corresponding RNN cell and wrapping it in a RNN
layer. However using the built-in GRU and LSTM layers enables the use
of CuDNN and you may see better performance.
So how can I achieve this? How do I get a GRU layer that supports both multiple hidden layers and has support for CuDNN? Given that the inbuilt GRU layer in TensorFlow lacks such an option, is it in fact necessary? Or is the only way to get a deep GRU network is to stack multiple GRU layers in a sequence?
EDIT: It seems, according to this answer to a similar question, that there is indeed no in-built way to create a GRU Layer with multiple hidden layers, and that they have to be stacked manually.

OK, so it seems the only way to achieve this is to define a stack of GRU Layer instances. This is what I came up with (note that I only need stateful GRU layers that return sequences, and don't need the last layer's return state):
class RNN(tf.keras.layers.Layer):
def __init__(self, dim, num_layers=1):
super(RNN, self).__init__()
self.dim = dim
self.num_layers = num_layers
def layer():
return tf.keras.layers.GRU(
self.dim,
return_sequences=True,
return_state=True,
stateful=True)
self._layer_names = ['layer_' + str(i) for i in range(self.num_layers)]
for name in self._layer_names:
self.__setattr__(name, layer())
def call(self, inputs):
seqs = inputs
state = None
for name in self._layer_names:
rnn = self.__getattribute__(name)
(seqs, state) = rnn(seqs, initial_state=state)
return seqs
It's necessary to manually add the internal rnn layers to the parent layer using __setattr__. It seems adding the rnns to a list and setting that as a layer attribute won't allow the internal layers to be tracked by the parent layer (see this answer to this issue).
I hoped that this would speed up my network. Tests on Colab have showed no difference so far, if anything it's actually slightly slower than using a straight RNN initialized with a list of GRU cells. I thought that increasing the batch size from 10 to 64 might make a difference, but no, they still seem to be performing at around the same speed.
UPDATE: In fact there does seem to be a noticeable speed up, but only if I don't decorate my training step function with tf.function (I have a custom training loop, I don't use Model.fit). Not a huge increase in speed - maybe about 33% faster, with a batch size of 96. A much smaller batch size (between 10 to 20) gives an even bigger speed up, about 70%.

Related

gaussian projection versus gaussian noise

I am facing difficulties with the following layer in keras:
gaussian_projection = 64
gaussian_scale = 20
initializer = tf.keras.initializers.TruncatedNormal(mean=0.0, stddev=gauss_scale)
proj_kernel = tf.keras.layers.Dense(gaussian_projection, use_bias=False, trainable=False,
kernel_initializer=initializer)
What does above layers intends to do? Is it a layer to add gaussian noise or something different?
I hope someone knows about it.
##################### Another 2nd version of the layer ##########
input_dim = 3
new_layer = tf.keras.layers.Dense(input_dim, use_bias=False, trainable=False,
kernel_initializer='identity')
tf.keras.layers.GaussianNoise(stddev=gaussian_scale)
Does both version of layers (1st and 2nd) intends to do the same thing, i.e., adding gaussian noise?
I think the above 2 are different as follows:
The first block of codes basically create a Dense layer, in which the gaussian_projection variable is the number of units and the initializer is a way to initialize the layer. This initialization is normally done to improve the convergence of the layer and network; but overall, the first block of codes is a typical Dense layer. I think there is no noise added in this first block of code.
On the other hand, the second block of codes create a GaussianNoise layer after the Dense layer, which is normally done to regularize the network and reduce overfitting. And based on the official documentation, this GaussianNoise layer is only active during training.

Custom loss function in Keras that penalizes output from intermediate layer

Imagine I have a convolutional neural network to classify MNIST digits, such as this Keras example. This is purely for experimentation so I don't have a clear reason or justification as to why I'm doing this, but let's say I would like to regularize or penalize the output of an intermediate layer. I realize that the visualization below does not correspond to the MNIST CNN example and instead just has several fully connected layers. However, to help visualize what I mean let's say I want to impose a penalty on the node values in layer 4 (either pre or post activation is fine with me).
In addition to having a categorical cross entropy loss term which is typical for multi-class classification, I would like to add another term to the loss function that minimizes the squared sum of the output at a given layer. This is somewhat similar in concept to l2 regularization, except that l2 regularization is penalizing the squared sum of all weights in the network. Instead, I am purely interested in the values of a given layer (e.g. layer 4) and not all the weights in the network.
I realize that this requires writing a custom loss function using keras backend to combine categorical crossentropy and the penalty term, but I am not sure how to use an intermediate layer for the penalty term in the loss function. I would greatly appreciate help on how to do this. Thanks!
Actually, what you are interested in is regularization and in Keras there are two different kinds of built-in regularization approach available for most of the layers (e.g. Dense, Conv1D, Conv2D, etc.):
Weight regularization, which penalizes the weights of a layer. Usually, you can use kernel_regularizer and bias_regularizer arguments when constructing a layer to enable it. For example:
l1_l2 = tf.keras.regularizers.l1_l2(l1=1.0, l2=0.01)
x = tf.keras.layers.Dense(..., kernel_regularizer=l1_l2, bias_regularizer=l1_l2)
Activity regularization, which penalizes the output (i.e. activation) of a layer. To enable this, you can use activity_regularizer argument when constructing a layer:
l1_l2 = tf.keras.regularizers.l1_l2(l1=1.0, l2=0.01)
x = tf.keras.layers.Dense(..., activity_regularizer=l1_l2)
Note that you can set activity regularization through activity_regularizer argument for all the layers, even custom layers.
In both cases, the penalties are summed into the model's loss function, and the result would be the final loss value which would be optimized by the optimizer during training.
Further, besides the built-in regularization methods (i.e. L1 and L2), you can define your own custom regularizer method (see Developing new regularizers). As always, the documentation provides additional information which might be helpful as well.
Just specify the hidden layer as an additional output. As tf.keras.Models can have multiple outputs, this is totally allowed. Then define your custom loss using both values.
Extending your example:
input = tf.keras.Input(...)
x1 = tf.keras.layers.Dense(10)(input)
x2 = tf.keras.layers.Dense(10)(x1)
x3 = tf.keras.layers.Dense(10)(x2)
model = tf.keras.Model(inputs=[input], outputs=[x3, x2])
for the custom loss function I think it's something like this:
def custom_loss(y_true, y_pred):
x2, x3 = y_pred
label = y_true # you might need to provide a dummy var for x2
return f1(x2) + f2(y_pred, x3) # whatever you want to do with f1, f2
Another way to add loss based on input or calculations at a given layer is to use the add_loss() API. If you are already creating a custom layer, the custom loss can be added directly to the layer. Or a custom layer can be created that simply takes the input, calculates and adds the loss, and then passes the unchanged input along to the next layer.
Here is the code taken directly from the documentation (in case the link is ever broken):
from tensorflow.keras.layers import Layer
class MyActivityRegularizer(Layer):
"""Layer that creates an activity sparsity regularization loss."""
def __init__(self, rate=1e-2):
super(MyActivityRegularizer, self).__init__()
self.rate = rate
def call(self, inputs):
# We use `add_loss` to create a regularization loss
# that depends on the inputs.
self.add_loss(self.rate * tf.reduce_sum(tf.square(inputs)))
return inputs

Clarification on Tensorflow 2.0 Masking

From the Tensorflow documentation when using Keras subclassing API, they give this example on how to pass a mask along to other layers that implement masking. I am wondering if this is explicitly required or if it is handled correctly after the Embedding layer has mask_zero=True.
class MyLayer(layers.Layer):
def __init__(self, **kwargs):
super(MyLayer, self).__init__(**kwargs)
self.embedding = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
self.lstm = layers.LSTM(32)
def call(self, inputs):
x = self.embedding(inputs)
# Note that you could also prepare a `mask` tensor manually.
# It only needs to be a boolean tensor
# with the right shape, i.e. (batch_size, timesteps).
mask = self.embedding.compute_mask(inputs)
output = self.lstm(x, mask=mask) # The layer will ignore the masked values
return output
layer = MyLayer()
x = np.random.random((32, 10)) * 100
x = x.astype('int32')
layer(x)
My confusion comes from another area of the documentation which states:
Masking
This layer supports masking for input data with a variable number of
timesteps. To introduce masks to your data, use an Embedding layer
with the mask_zero parameter set to True.
Which seems to mean that if mask_zero=True no further commands need to be done on subsequent layers.
If you read about the Masking layer, it will also support that once you used the mask at the beginning, all the rest of the layers get the mask automatically.
Quote:
For each timestep in the input tensor (dimension #1 in the tensor), if all values in the input tensor at that timestep are equal to mask_value, then the timestep will be masked (skipped) in all downstream layers (as long as they support masking).
If any downstream layer does not support masking yet receives such an input mask, an exception will be raised.
This other link also states the same. The mask will be propagated to all layers.
Quote:
When using the Functional API or the Sequential API, a mask generated by an Embedding or Masking layer will be propagated through the network for any layer that is capable of using them (for example, RNN layers). Keras will automatically fetch the mask corresponding to an input and pass it to any layer that knows how to use it.
The second link is really full of details on masking.
Notice that the code you showed is for a custom embedding. If teaches you how to "create and pass" a mask, if you want to create a layer that will create a mask. It's basically showing what the normal Embedding layer does.
So, we can conclude that if you're using a normal Embedding layer, all you need is mask_zero=True and everything will go down the stream.
In addition to the high-level answer given, let's have a look at some important technical details.
In case of doubts inspect the masking source code, to understand how it works.
Masking adds a _keras_mask attribute to the tensor, which flags entries to be skipped, effectively letting other API methods know about it.
Test yourself if a layer supports the mask, via supports_masking attribute. Example: tf.keras.layers.GlobalMaxPool1D().supports_masking
Masking logic is: skip a timestep if all features are equal to the masked value (TF source code uses not_equal and any to flag what remains)
import tensorflow ast f
arr = np.arange(6).reshape((1,6,1))
arr_masked = tf.keras.layers.Masking(mask_value=5)(arr)
print(arr_masked._keras_mask)
print(arr_masked.numpy())
I think you have to pass the mask from layer to layer in a subclassing layer.
From the Tensorflow documentation: Quote
Note that in the call method of a subclassed model or layer, masks aren't automatically propagated, so you will need to manually pass a mask argument to any layer that needs one.

Writing own convolutional layer in Keras from scratch

I would like to create my own layer in Keras. To be more precision I would like to create simple convolution layer using only NumPy library(without TensorFlow part). I have some reasons for do that - first, for learning something new and second I have some idea how to modify that layer, so I have to write it from scratch. To make problem easier we can assume that I need only convolutional layer with 3x3 kernel size and default for others parameters.
I know I have to base on: https://keras.io/layers/writing-your-own-keras-layers/
In def build(self, input_shape): section I have to add weights. Convolutional layer needs filters times kernel matrix with 3x3 size.
In def call(self, x): section I can use that weights. But I have some problems with that.
Problems:
I need to get something like sliding through the input - typical convolutional layer task(moving 3x3 matrix through image). But I can't do that because x in def call(self, x): have ? or None in first value in shape. I know it is batch_size, but I can't use loop on that tensor because of that. So how can I get all data(numbers) from x to make some operations using them?
Maybe you have some general tips how can I make my own Convolutional Layer from scratch in Keras?
The problem for me is not to write Convolutional Layer in numpy(there is materials about that - for example: https://github.com/Eyyub/numpy-convnet ) but to marge it with Keras without using TensorFlow backend.

How can you apply sequence-wise batch normalization after each layer in a multi-layer RNN?

I am reproducing the Sequential MNIST experiment from this paper, where they use a Recurrent Neural Network with 6 layers and apply batch normalization after each layer.
They seem to use sequence-wise normalization meaning that the outputs are normalized not only across batches but across time steps as well. This is a problem, because it means that I cannot modify the BasicRNNCell to do the batch normalization in the cell's call method. For that to work the method would have to know what it outputs in the future time steps.
So, my current solution is for each layer to:
Unroll the RNN layer
Add a batch normalization layer after this
In code it looks like this:
layer_input = network_input
for layer in range(6):
cell = BasicRNNCell(128)
layer_output, _ = tf.nn.dynamic_rnn(cell, layer_input)
layer_output = tf.layers.batch_normalization(layer_output)
layer_input = layer_output
network_output = layer_output
My question: Unrolling the RNN for every layer seems like the brute force way to achieve sequence-wise batch normalization after each layer. Is there a more efficient way, for example one that uses MultiRNNCell?