I have created my own lstm neural network in vb.net. From what I've read lstm networks are not meant to suffer from the effects of exploding/vanishing gradients. However, after a while all the gradients will increase to the maximum. Changing the rate only affects the time taken for this to happen. Is there anything that can cause exploding gradients in an lstm network?
I'm using RMSProp with momentum to update weights with a sequence size ranging from 32 to 64. Also includes peephole connectors with the training data in the range of [0,1].
I based it off the paper, LSTM: A space search odyssey
I had the same problem with a LSTM in pytorch. It helped to clip the gradient.
Additionally you could try to change the learning rate.
Related
I have a data to train and test using Fully-connected Deep neural network FC-DNN; The size of the input data I should train is almost 3000, first hidden layer should be up to 4096, third layer 4096 and finally the output layer should be 3000.
My question is the size of deep neural network is reasonable and acceptable? What is the maximum reasonable size of deep neural network?
There is no maximum reasonable size (neither for neurons per layer nor for number of layers). After a specific point (which is really dependent on the problem you try to solve), you have diminishing returns when applying multiple Dense layers . In fact, it can lead to overfitting which should be avoided. At the same time, in absence of residual connections, stacking multiple Dense layers (making the network super-deep) could also lead to the vanishing gradient problem.
You should manually try to add a few layers and only if you see that your network does not perform well (low accuracy/other metric in your problem which is a sign of underfitting) should you add more layers.
Also, for the last layer I do not think that your problem requires 3000 neurons. If it is regression one neuron with linear activation will suffice. 3000 neurons are only needed if you have 3000 different classes (here we talk only about regression and classification).
I am training a neural network for regression with TensorFlow, and getting strange behaviour on my loss curves. The task is to predict the motion of an object in an image, when an action is applied to the object. So the network takes in an image, and an action, and outputs the motion.
The image input is followed by three CNN layers, and in parallel, the action input is followed by a dense layer. These are then concatenated, and followed by two dense layers, before the output. All layers have ReLUs. Data is normalised to have zero mean and standard deviation of one.
Below is the training curve:
The strange behaviour is that, whilst the training loss decreases over time, the validation loss increases right from the start. Usually, when the training curve decreases far below the validation curve, this indicates overfitting. However, in my case, the validation curve never actually decreases at all. Usually, overfitting can be diagnosed when the validation curve drops and then rises again.
Instead, it is as if the network is overfitting right from the very first epoch. In fact, the validation curve seems to follow the opposite trajectory to the training curve. Every improvement in the training prediction results in an opposite effect on the validation prediction.
I have also tried varying the step size (I am using Adam, and in this graph, the step size is 0.0001, then reduces to 0.00001 at epoch 100). My network uses dropout on all the dense layers. I have also tried reducing the number of parameters in the network to prevent overfitting, but the same behaviour occurs. I have a batch size of 50.
What could be the diagnosis of this behaviour? Is the network overfitting, or something else? If it is overfitting, then why do my attempts to reduce the number of parameters and add dropout still result in this same effect? Andy why does the overfitting occur immediately, without the validation loss decreasing at all?
Thank you!
I am working on a deep learning (CNN + AEs) approach on facial images.
I have
an input layer of 112*112*3 of facial images
3 convolution + max pooling + ReLU
2 layers of fully connected with 512 neurons with 50% dropout to
avoid overfitting and last output layer with 10 neurons since I have
10 classes.
also used reduce mean of softmax cross entropy and also L2.
For training I divided my dataset to 3 groups of:
60% for training
20% for validation
20% for evaluation
The problem is after few epochs the validation error rate stay fixed value and never changes. I have used tensorflow to implement my project.
I hadn't such problem before with CNNs so I think it's first time. I have checked the code it's based on tensorflow documentation so I don't think if the problem is with the code. Maybe I need to change some parameters but I am not sure.
Any idea about common solutions for such problem?
Update:
I changed the optimizer from momentum to Adam whith default learning rate. For now validation error changes but it's lower than mini batch error most of the time while both have same batch sizes.
I have tested the model with and without biases with 0.1 as initial values but no good fit yet.
Update
I fixed the issue I will update with more details soon.
One common solution that I found helpful for this type of problem is using TensorBoard. You can add details visualize training performance information after each epoch for different points in the computational graph. Adding key metrics is worth it since you can see how training progresses after applying changes in the adaptive learning rate, batch size, neural network architecture, drop out / regularization, number of GPUs, etc.
Here is the link that I found helpful to add these details:
https://www.tensorflow.org/how_tos/graph_viz/#runtime_statistics
I want to use a relu activation for my simple RNN in a tensorflow model I am building. It sits on top of a deep convolutional network. I am trying to classify a sequence of images. I noticed that the default activation in both keras and tensorflow source code is tanh for simple RNNs. Is there a reason for this? Is there anything wrong with using relu? It seems like relu would help better with the vanishing gradients.
nn = tf.nn.rnn_cell.BasicRNNCell(1024, activation = tf.nn.relu)
RNNs can suffer from both exploding gradient and vanishing gradient problems. When the sequence to learn is long, then this can be a very delicate balance tipping into one or the other quite easily. Both problems are caused by exponentiation - each layer multiplies by weight matrix and derivative of activation, so if either the matrix magnitude or activation derivative is different from 1.0, there will be a tendency towards exploding or vanishing.
ReLUs do not help with exploding gradient problems. In fact they can be worse than activation functions which are naturally limited when weights are large such as sigmoid or tanh.
ReLUs do help with vanishing gradient problems. However, the designs of LSTM and GRU cells are also intended to address the same problem (of dealing with learning from potentially weak signals many time steps away), and do so very effectively.
For a simple RNN with short time series, there should be nothing wrong working with ReLU activation. To address the possibility of exploding gradients when training, you could look at gradient clipping (treating gradients outside of allowed range as being the min or max of that range).
I can see two reasons:
LSTMs (the underlying RNN block) have been always defined in literature to use the tanh activation function. And that is what most users will expect from the implementation.
If I recall correctly tanh works better than relu for recurrent networks, but I can't find the paper / resource of this memory.
You are encouraged to experiment yourself on your particular dataset/problem which of the two activation functions perform best.
I am working on a project which is to localize object in a image. The method I am going to adopt is based on the localization algorithm in CS231n-8.
The network structure has two optimization heads, classification head and regression head. How can I minimize both of them when training the network?
I have one idea that summarizing both of them into one loss. But the problem is classification loss is softmax loss and regression loss is l2 loss, which means they have different range. I don't think this is the best way.
It depends on your network status.
If your network is just able to extract features [you're using weights kept from some other net], you can set this weights to be constants and then train separately the two classification heads, since the gradient will not flow trough the constants.
If you're not using weights from a pre-trained model, you
Have to train the network to extract features: thus train the network using the classification head and let the gradient flow from the classification head to the first convolutional filter. In this way your network now can classify objects combining the extracted features.
Convert to constant tensors the learned weights of the convolutional filters and the classification head and train the regression head.
The regression head will learn to combine the features extracted from the convolutional layer adapting its parameters in order to minimize the L2 loss.
Tl;dr:
Train the network for classification first.
Convert every learned parameter to a constant tensor, using graph_util.convert_variables_to_constants as showed in the 'freeze_graph` script.
Train the regression head.