When predicting with an LSTM in Keras, is the hidden state still adjusted? - sequence

When I first train an LSTM in Keras on sequence data - my training data -
and then use model.predict() to make predictions with my test data as input, is the hidden state of the LSTM still being adjusted?

Basic operation of a neural network is to take an input (vector) which is connected to the output with connections and, sometimes, other layers such as context layers. These connections are modelled as matrices and vary in strength, we call these weight matrices.
This means that the only thing we do when we are feeding data into the network is to put a vector into the network, multiply the values with the weight matrix and call that the output. In special cases, like recurrent networks, we even keep some values stored in other vectors and combine this stored value with the current input.
During training we not only feed data into the network, we also compute an error value that we evaluate in a clever way so that it tells us how we should change the weight matrices we multiply our inputs (and possibly past inputs for recurrent layers) with.
Therefore: yes, of course the basic execution behavior does not change for recurrent layers. We are just not updating weights anymore.
There are layers that do behave differently during execution time because they are treated as regularisers, i.e. methods that make training the network more efficient, which are deemed as unnecessary during execution. Examples for these layers are Noise and BatchNormalization. Almost all neural network layers (including recurrent ones) include drop-out which is another form of regularisation which disables a random percentage of connections in the layer. This is also only done during training.

Related

diagnosis on training process of neural network

I am training an autoencoder DNN for a regression question. Need suggestions on how to improve the training process.
The total number of training sample is about ~100,000. I use Keras to fit the model, setting validation_split = 0.1. After training, I drew loss function change and got the following picture. As can be seen here, validation loss is unstable and mean values are very close to training loss.
My question is: based on this, what is the next step I should try to improve the training process?
[Edit on 1/26/2019]
The details of network architecture are as follows:
It has 1 latent layer of 50 nodes. The input and output layer have 1000 nodes,respectively. The activation of hidden layer is ReLU. Loss function is MSE. For optimizer, I use Adadelta with default parameter settings. I also tried to set lr=0.5, but got very similar results. Different features of the data have scaled between -10 and 10, with mean of 0.
By observing the graph provided, the network could not approximate the function which establishes a relation between the input and output.
If your features are too diverse. That one of them is large and others have a very small value, then you should normalize the feature vector. You can read more here.
For a better training and testing result, you can follow these tips,
Use a small network. A network with one hidden layer is enough.
Perform activations in the input as well as hidden layers. The output layer must have a linear function. Use ReLU activation function.
Prefer small learning rate like 0.001. Use RMSProp optimizer. It works fine on most regression problems.
If you are not using mean squared error function, use it.
Try slow and steady learning and not fast learning.

Large trainable embedding layer slows down training

I am training a network to classify text with a LSTM. I use a randomly initialized and trainable embedding layer for the word inputs. The network is trained with the Adam Optimizer and the words are fed into the network with a one-hot-encoding.
I noticed that the number of words which are represented in the embedding layer influences heavily the training time, but I don't understand why. Increasing the number of words in the network from 200'000 to 2'000'000 almost doubled the time for a training epoch.
Shouldn't the training only update weights which where used during the prediction of the current data point. Thus if my input sequence has always the same length, there should always happen the same number of updates, regardless of the size of the embedding layer.
The number of updates needed would be reflected in the number of epochs it takes to reach a certain precision.
If your observation is that convergence takes the same number of epochs, but each epoch takes twice as much wall clock time, then it's an indication that simply performing the embedding lookup (and writing the update of embedding table) now takes a significant part of your training time.
Which could easily be the case. 2'000'000 words times 4 bytes per float32 times the length of your embedding vector (what is it? let's assume 200) is something like 1.6 gigabytes of data that needs to be touched every minibatch. You're also not saying how you're training this (CPU, GPU, what GPU) which has a meaningful impact on how this should turn out because of e.g. cache effects, as for CPU doing the exact same number of reads/writes in a slightly less cache-friendly manner (more sparsity) can easily double the execution time.
Also, your premise is a bit unusual. How much labeled data do you have that would have enough examples of the #2000000th rarest word to calculate a meaningful embedding directly? It's probably possible, but would be unusual, in pretty much all datasets, including very large ones, the #2000000th word would be a nonce and thus it'd be harmful to include it in trainable embeddings. The usual scenario would be to calculate large embeddings separately from large unlabeled data and use that as a fixed untrainable layer, and possibly concatenate them with small trainable embeddings from labeled data to capture things like domain-specific terminology.
If I understand correctly, your network takes one-hot vectors representing words to embeddings of some size embedding_size. Then the embeddings are fed as input to an LSTM. The trainable variables of the network are both those of the embedding layer and the LSTM itself.
You are correct regarding the update of the weights in the embedding layer. However, the number of weights in one LSTM cell depends on the size of the embedding. If you look for example at the equation for the forget gate of the t-th cell,
you can see that the matrix of weights W_f is multiplied by the input x_t, meaning that one of the dimensions of W_f must be exactly embedding_size. So as embedding_size grows, so does the network size, so it takes longer to train.

Neural network immediately overfitting

I have a FFNN with 2 hidden layers for a regression task that overfits almost immediately (epoch 2-5, depending on # hidden units). (ReLU, Adam, MSE, same # hidden units per layer, tf.keras)
32 neurons:
128 neurons:
I will be tuning the number of hidden units, but to limit the search space I would like to know what the upper and lower bounds should be.
Afaik it is better to have a too large network and try to regularize via L2-reg or dropout than to lower the network's capacity -- because a larger network will have more local minima, but the actual loss value will be better.
Is there any point in trying to regularize (via e.g. dropout) a network that overfits from the get-go?
If so I suppose I could increase both bounds. If not I would lower them.
model = Sequential()
model.add(Dense(n_neurons, 'relu'))
model.add(Dense(n_neurons, 'relu'))
model.add(Dense(1, 'linear'))
model.compile('adam', 'mse')
Hyperparameter tuning is generally the hardest step in ML, In general we try different values randomly and evalute the model and choose those set of values which give the best performance.
Getting back to your question, You have a high varience problem (Good in training, bad in testing).
There are eight things you can do in order
Make sure your test and training distribution are same.
Make sure you shuffle and then split the data into two sets (test and train)
A good train:test split will be 105:15K
Use a deeper network with Dropout/L2 regularization.
Increase your training set size.
Try Early Stopping
Change your loss function
Change the network architecture (Switch to ConvNets, LSTM etc).
Depending on your computation power and time you can set a bound to the number of hidden units and hidden layers you can have.
because a larger network will have more local minima.
Nope, this is not quite true, in reality as the number of input dimension increases the chance of getting stuck into a local minima decreases. So We usually ignore the problem of local minima. It is very rare. The derivatives across all the dimensions in the working space must be zero for a local/global minima. Hence, it is highly unlikely in a typical model.
One more thing, I noticed you are using linear unit for last layer. I suggest you to go for ReLu instead. In general we do not need negative values in regression. It will reduce test/train error
Take this :
In MSE 1/2 * (y_true - y_prediction)^2
because y_prediction can be nagative value. The whole MSE term may blow up to large values as y_prediction gets highly negative or highly positive.
Using a ReLu for last layer makes sure that y_prediction is positive. Hence low error will be expected.
Let me try to substantiate some of the ideas here, referenced from Ian Goodfellow et. al. Deep Learning book which is available for free online:
Chapter 7: Regularization The most important point is data, one can and should avoid regularization if they have large amounts of data that best approximate the distribution. In you case, it looks like there might be a significant discrepancy between training and test data. You need to ensure the data is consistent.
Section 7.4: Data-augmentation With regards to data, Goodfellow talks about data-augmentation and inducing regularization by injecting noise (most likely Gaussian) which mathematically has the same effect. This noise works well with regression tasks as you limit the model from latching onto a single feature to overfit.
Section 7.8: Early Stopping is useful if you just want a model with the best test error. But again this only works if your data allows the training to infer the test data. If there is an immediate increase in test error the training would stop immediately.
Section 7.12: Dropout Just applying dropout to a regression model doesn't necessarily help. In fact "when extremely few labeled training examples are available, dropout is less effective". For classification, dropout forces the model to not rely on single features, but in regression all inputs might be required to compute a value rather than classify.
Chapter 11: Practicals emphasises the use of base models to ensure that the training task is not trivial. If a simple linear regression can achieve similar behaviour than you don't even have a training problem to begin with.
Bottom line is you can't just play with the model and hope for the best. Check the data, understand what is required and then apply the corresponding techniques. For more details read the book, it's very good. Your starting point should be a simple regression model, 1 layer, very few neurons and see what happens. Then incrementally experiment.

Is batchnorm used in neural networks that are not CNN?

1.) Batchnorm is always used in deep convolutional neural networks. But is it also used in not-CNN. In NN. In networks with just fully-connected layers?
2.) Is batchnorm used in shallow CNNs?
3.) If I have a CNN with an input image and an input array IN_array, the output is an array after the last fully-connected layer. I call this array FC_array. If I want to concat that FC_array with the IN_array.
CONCAT_array = tf.concat(values=[FC_array, IN_array])
Is it useful to have a bachnorm after the concat layer? Or should that batchnorm be just after the FC_array before the concat layer?
For information, the IN_array is a tf.one_hot() vector.
Thank you
TL;DR: 1. Yes 2. Yes 3. No
TS;WM:
Batch normalization was a great invention by Sergey Ioffe and Christian Szegedy early 2015. Back in those days, battling vanishing or exploding gradients was an everyday problem. Read that article if you want to gain a deep understanding. but basically this quote from the abstract should give you some idea:
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs.
They did in fact first use batch normalization for DCNNs, which allowed them to beat human performance in the top-5 ImageNet classification, but any network where there are nonlinearities can benefit from batch normalization. Including a network consisting of fully-connected layers.
Yes, it is used for shallow CNN-s too. Any network with more than one layer can benefit from it, albeit it is true that more benefit comes to deeper networks.
First of all, one-hot vectors should never be normalized. Normalization means you subtract the mean and divide by the variance, thus creating a dataset with 0 mean and 1 variance. If you do this to a one-hot vector, then the cross-entropy loss calculation will be completely off. Second, there is no point in normalizing a concat layer separately, since it does not change the values, just concatenates them. Batch normalization is done on the input of a layer, so the one after the concat, that will get the concatenated values, can do it if necessary.

tensorflow - how to use variational recurrent dropout correctly

The tensorflow config dropout wrapper has three different dropout probabilities that can be set: input_keep_prob, output_keep_prob, state_keep_prob.
I want to use variational dropout for my LSTM units, by setting the variational_recurrent argument to true. However, I don't know which of the three dropout probabilities I have to use for variational dropout to function correctly.
Can someone provide help?
According to this paper https://arxiv.org/abs/1512.05287 that is used for implementation of the variational_recurrent dropouts, you can think about as follows,
input_keep_prob - probability that dropping out input connections.
output_keep_prob - probability that dropping out output connections.
state_keep_prob - Probability that droping out recurrent connections.
See the diagram below,
If you set the variational_recurrent to be true you will get an RNN that's similar to the model in right and otherwise in left.
The basic differences in above two models are,
Variational RNN repeats the same dropout mask at each time
step for both inputs, outputs, and recurrent layers (drop
the same network units at each time step).
Native RNN uses different dropout masks at each time step for the
inputs and outputs alone (no dropout is used with the recurrent
connections since the use of different masks with these connections
leads to deteriorated performance).
In the above diagram, coloured connections represent the dropped-out connections, with different colours corresponding to different dropout masks. Dashed lines correspond to standard connections with no dropout.
Therefore, if you use a variational RNN you can set all three probability parameters according to your requirement.
Hope this helps.