tensorflow - how to use variational recurrent dropout correctly - tensorflow

The tensorflow config dropout wrapper has three different dropout probabilities that can be set: input_keep_prob, output_keep_prob, state_keep_prob.
I want to use variational dropout for my LSTM units, by setting the variational_recurrent argument to true. However, I don't know which of the three dropout probabilities I have to use for variational dropout to function correctly.
Can someone provide help?

According to this paper https://arxiv.org/abs/1512.05287 that is used for implementation of the variational_recurrent dropouts, you can think about as follows,
input_keep_prob - probability that dropping out input connections.
output_keep_prob - probability that dropping out output connections.
state_keep_prob - Probability that droping out recurrent connections.
See the diagram below,
If you set the variational_recurrent to be true you will get an RNN that's similar to the model in right and otherwise in left.
The basic differences in above two models are,
Variational RNN repeats the same dropout mask at each time
step for both inputs, outputs, and recurrent layers (drop
the same network units at each time step).
Native RNN uses different dropout masks at each time step for the
inputs and outputs alone (no dropout is used with the recurrent
connections since the use of different masks with these connections
leads to deteriorated performance).
In the above diagram, coloured connections represent the dropped-out connections, with different colours corresponding to different dropout masks. Dashed lines correspond to standard connections with no dropout.
Therefore, if you use a variational RNN you can set all three probability parameters according to your requirement.
Hope this helps.

Related

Should feature embeddings be taken before or after dropout layer in neural network?

I am training a binary text classification model using BERT as follows:
def create_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)
# Neural network layers
l1 = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l2 = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l1)
# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs=[l2])
return model
This code is borrowed from the example on tfhub: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4.
I want to extract feature embeddings from the penultimate layer and use them for comparison, clustering, visualization, etc between examples. Should this be done before dropout (l1 in the model above) or after dropout (l2 in the model above)?
I am trying to figure out whether this choice makes a significant difference, or is it fine either way? For example, if I extract feature embeddings after dropout and compute feature similarities between two examples, this might be affected by which nodes are randomly set to 0 (but perhaps this is okay).
In order to answer your question let's recall how a Dropout layer works:
The Dropout layer is usually used as a means to mitigate overfitting. Suppose two layers, A and B, are connected through a Dropout layer. Then during the training phase, neurons in layer A are being randomly dropped. That prevents layer B from becoming too dependent upon specific neurons in layer A, as these neurons are not always available. Therefore, layer B has to take into consideration the overall signal coming from layer A, and (hopefully) cannot cling to some noise which is specific to the training set.
An important point to note is that the Dropout mechanism is activated only during the training phase. While predicting, Dropout does nothing.
If I understand you correctly, you want to know whether to take the features before or after the Dropout (note that in your network l1 denotes the features after Dropout has been applied). If so, I would take the features before Dropout, because technically it does not really matter (Dropout is inactive during prediction) and it is more reasonable to do so (Dropout is not meaningful without a following layer).

What are the uses of layers in keras/Tensorflow

So I am new to computer vision, and I do not really know what the layers do in keras. What is the use of adding layers (dense, Conv2D, etc) in keras? What do they add to it?
Convolution neural network has 4 main steps: Convolution, Pooling, Flatten, and Full connection.
Conv2D(), Conv3D(), etc. is for Feature extraction (It's a Convolution Layer).
Pooling layers (MaxPool2D(), AvgPool2D(), etc) is for Feature extraction as well (It has different operation though).
Flattening layers (Flatten() ) are to convert the extracted feature map into Vector before being fed into the Fully connection layers (The Dense layers).
Dense layers are for Fully connected step in Computer vision that acts as Classifier (The Neural network classify each extracted features from the Convolution layers.)
There are also optimization layers such as Dropout(), BatchNormalization(), etc.
For more information, just open the keras documentation.
If you want to start learning Convolution neural network, this article may help.
A layer in an Artificial Neural Network is a bunch of nodes bound together at a specific depth in a Neural Network. Keras is a high level API used over NN modules like TensorFlow or CNTK in order to simplify tasks. A Keras layer comprises 3 main parts:
Input Layer - Which contains the raw data
Hidden layer - Where the nodes of a layer learn some aspects about
the raw data which is input. It's similar to levels of abstraction
to form a Neural network.
Output Layer - Consists of a single output which is mostly a single
node and can be subjected to classification.
Keras, as a whole consists of many different types of layers. A Convolutional layer creates a kernel which is convoluted with the input over a single temporal space to derive a group of outputs. Pooling layers provide sampling of the feature maps by simplifying features in a map into patches. Max Pooling and Average Pooling are commonly used methods in a Pool layer.
Other commonly used layers in Keras are Embedding layers, Noise layers and Core layers. A single NN layer can represent only a Linearly seperable method. Most prediction problems are complicated and more than just one layer is required. This is where Multi Layer concept is required.
I think i clear your doubts and for any other queries you can see on https://www.tensorflow.org/api_docs/python/tf/keras
Neural networks are a great tool nowadays to automate classification problems. However when it comes to computer vision the amount of input data is too great to be handled efficiently by simple neural networks.
To reduce the network workload, your data needs to be preprocessed and certain features need to be identified. To find features in images we can use certain filters (like sobel edge detection), which will highlight the essential features needed for classification.
Again the amount of filters required to classify one image is too great, and thus the selection of those filters needs to be automated.
That's where the convolutional layer comes in.
We use a convolutional layer to generate multiple random (at first) filters that will highlight certain features in an image. While the network is training those filters are optimized to do a better job at highlighting features.
In Tensorflow we use Conv2D() to add one of those layers. An example of parameters is : Conv2D(64, 3, activation='relu'). 64 denotes the number of filters used, 3 denotes the size of the filters (in this case 3x3) and activation='relu' denotes the activation function
After the convolutional layer we use a pooling layer to further highlight the features produced by the previous convolutional layer. In Tensorflow this is usually done with MaxPooling2D() which takes the filtered image and applies a 2x2 (by default) layer every 2 pixels. The filter applied by MaxPooling is basically looking for the maximum value in that 2x2 area and adds it in a new image.
We can use this set of convolutional layer and pooling layers multiple times to make the image easier for the network to work with.
After we are done with those layers, we need to pass the output to a conventional (Dense) neural network.
To do that, we first need to flatten the image data from a 2D Tensor(Matrix) to a 1D Tensor(Vector). This is done by calling the Flatten() method.
Finally we need to add our Dense layers which are used to train on the flattened data. We do this by calling Dense(). An example of parameters is Dense(64, activation='relu')
where 64 is the number of nodes we are using.
Here is an example CNN structure I used recently:
# Build model
model = tf.keras.models.Sequential()
# Convolution and pooling layers
model.add(tf.keras.layers.Conv2D(64, 3, activation='relu', input_shape=(IMG_SIZE, IMG_SIZE, 1))) # Input layer
model.add(tf.keras.layers.MaxPooling2D())
model.add(tf.keras.layers.Conv2D(64, 3, activation='relu'))
model.add(tf.keras.layers.MaxPooling2D())
# Flattened layers
model.add(tf.keras.layers.Flatten())
# Dense layers
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(2, activation='softmax')) # Output layer
Of course this worked for a certain classification problem and the number of layers and method parameters differ depending on the problem.
The Youtube channel The Coding Train has a very helpful video explaining the Convolutional and Pooling layer.

Is batchnorm used in neural networks that are not CNN?

1.) Batchnorm is always used in deep convolutional neural networks. But is it also used in not-CNN. In NN. In networks with just fully-connected layers?
2.) Is batchnorm used in shallow CNNs?
3.) If I have a CNN with an input image and an input array IN_array, the output is an array after the last fully-connected layer. I call this array FC_array. If I want to concat that FC_array with the IN_array.
CONCAT_array = tf.concat(values=[FC_array, IN_array])
Is it useful to have a bachnorm after the concat layer? Or should that batchnorm be just after the FC_array before the concat layer?
For information, the IN_array is a tf.one_hot() vector.
Thank you
TL;DR: 1. Yes 2. Yes 3. No
TS;WM:
Batch normalization was a great invention by Sergey Ioffe and Christian Szegedy early 2015. Back in those days, battling vanishing or exploding gradients was an everyday problem. Read that article if you want to gain a deep understanding. but basically this quote from the abstract should give you some idea:
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs.
They did in fact first use batch normalization for DCNNs, which allowed them to beat human performance in the top-5 ImageNet classification, but any network where there are nonlinearities can benefit from batch normalization. Including a network consisting of fully-connected layers.
Yes, it is used for shallow CNN-s too. Any network with more than one layer can benefit from it, albeit it is true that more benefit comes to deeper networks.
First of all, one-hot vectors should never be normalized. Normalization means you subtract the mean and divide by the variance, thus creating a dataset with 0 mean and 1 variance. If you do this to a one-hot vector, then the cross-entropy loss calculation will be completely off. Second, there is no point in normalizing a concat layer separately, since it does not change the values, just concatenates them. Batch normalization is done on the input of a layer, so the one after the concat, that will get the concatenated values, can do it if necessary.

When predicting with an LSTM in Keras, is the hidden state still adjusted?

When I first train an LSTM in Keras on sequence data - my training data -
and then use model.predict() to make predictions with my test data as input, is the hidden state of the LSTM still being adjusted?
Basic operation of a neural network is to take an input (vector) which is connected to the output with connections and, sometimes, other layers such as context layers. These connections are modelled as matrices and vary in strength, we call these weight matrices.
This means that the only thing we do when we are feeding data into the network is to put a vector into the network, multiply the values with the weight matrix and call that the output. In special cases, like recurrent networks, we even keep some values stored in other vectors and combine this stored value with the current input.
During training we not only feed data into the network, we also compute an error value that we evaluate in a clever way so that it tells us how we should change the weight matrices we multiply our inputs (and possibly past inputs for recurrent layers) with.
Therefore: yes, of course the basic execution behavior does not change for recurrent layers. We are just not updating weights anymore.
There are layers that do behave differently during execution time because they are treated as regularisers, i.e. methods that make training the network more efficient, which are deemed as unnecessary during execution. Examples for these layers are Noise and BatchNormalization. Almost all neural network layers (including recurrent ones) include drop-out which is another form of regularisation which disables a random percentage of connections in the layer. This is also only done during training.

pruning tensorflow connections and weights (using cifar10 cnn)

I'm using tensorflow to run a cnn for image classification.
I use tensorflow cifar10 cnn implementation.(tensorflow cifar10)
I want to decrease the number of connections, meaning I want to prune the low-weight connections.
How can I create a new graph(subgraph) without some of the nuerones?
Tensorflow does not allow you lock/freeze a particular kernel of a particular layer, that I have found. The only I've found to do this is to use the tf.assign() function as shown in
How to freeze/lock weights of one Tensorflow variable (e.g., one CNN kernel of one layer
It's fairly cave-man but I've seen no other solution that works. Essentially, you have to .assign() the values every so often as you iterate through the data. Since this approach is so inelegant and brute-force, it's very slow. I do the .assign() every 100 batches.
Someone please post a better solution and soon!
The cifar10 model you point to, and for that matter, most models written in TensorFlow, do not model the weights (and hence, connections) of individual neurons directly in the computation graph. For instance, for fully connected layers, all the connections between the two layers, say, with M neurons in the layer below, and 'N' neurons in the layer above, are modeled by one MxN weight matrix. If you wanted to completely remove a neuron and all of its outgoing connections from the layer below, you can simply slice out a (M-1)xN matrix by removing the relevant row, and multiply it with the corresponding M-1 activations of the neurons.
Another way is add an addition mask to control the connections.
The first step involves adding mask and threshold variables to the
layers that need to undergo pruning. The variable mask is the same
shape as the layer's weight tensor and determines which of the weights
participate in the forward execution of the graph.
There is a pruning implementation under tensorflow/contrib/model_pruning to prune the model. Hope this can help you to prune model quickly.
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/model_pruning
I think google has an updated answer here : https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/model_pruning
Removing pruning nodes from the trained graph:
$ bazel build -c opt contrib/model_pruning:strip_pruning_vars
$ bazel-bin/contrib/model_pruning/strip_pruning_vars --checkpoint_path=/tmp/cifar10_train --output_node_names=softmax_linear/softmax_linear_2 --filename=cifar_pruned.pb
I suppose that cifar_pruned.pb will be smaller, since the pruned "or zero masked" variables are removed.