LSTM layer output size vs. hidden state size in KERAS - tensorflow

I am in trouble with understanding the concept of LSTM and using it on Keras. When considering a LSTM layer, there should be two values for output size and the hidden state size.
1. hidden state size : how many features are passed across the time steps of a samples when training the model
2. output size : how many outputs should be returned by particular LSTM layer
But in keras.layers.LSTM, there is only one parameter and it is used to control the output size of the layer.
PROBLEM:
Therefore how hidden state size of the LSTM layer can be changed?
If I am misunderstood, corrections are really appreciated.

You are getting confused between the difference in hidden units and output units in LSTM. Please refer to the below link for better clarity:
https://jasdeep06.github.io/posts/Understanding-LSTM-in-Tensorflow-MNIST/
Basically what you provide in num_units is the size of LSTM hidden unit only. This is very clear from this article.

Related

Last fc layers in VGG16

The VGG16 architecture has input: 224x224x3 images.I want to have 48x48x3 inputs but to do this in keras, we remove the last fc layers which have 4096 neurons each.Why we have to do this? and is it needed to add another size of fc layers for this input?
Final pooling layer of VGG16 has dimension 7x7x512 for 224x224 input image. From there VGG16 uses fully connected layer of (7x7x512)x4096 to get 4096 dimensional output. However, since your input size is different your feature output dimension from final pooling layer will also be different (2x2x512 I think). So you need to change matrix dimension for fully connected layer to make it work. You have two other options though
use a global average pooling across spatial dimension to get 512 dimensional feature and then use few fully connected layers to get to your number of classes.
Resize you input image to 224x224x3 and you won't need to change anything in model architecture.
Removing the last FC layers is for fine-tuning or transfer learning, where you adapt an existing network to a new problem, such as changing the number of categories that your classifier can choose between.
You are adapting the network to take a different sized input, so you need to adjust the first layer(s) of the network.

Setting initial state in dynamic RNN

Based on the link:
https://www.tensorflow.org/api_docs/python/tf/nn/dynamic_rnn
In the example, it is shown that the "initial state" is defined in the first example and not in the second example. Could anyone please explain what is the purpose of the initial state? What's the difference if I don't set it vs if i set it? Is it only required in a single RNN cell and not in a stacked cell like in the example provided in the link?
I'm currently debugging my RNN model, as it seemed to classify different questions in the same category, which is strange. I suspect that it might have to do with me not setting the initial state of the cell.
Could anyone please explain what is the purpose of initial state?
As we know that the state matrix is the weights between the hidden neurons in timestep 1 and timestep 2. They join the hidden neurons of both the time steps. Hence they hold temporal data from the layers in previous time steps.
Providing an initially trained state matrix by the initial_state= argument gives the RNN cell a trained memory of its previous activations.
What's the difference if I don't set it vs if I set it?
If we set the initial weights which have been trained on some other model or the previous model, it means that we are restoring the memory of the RNN cell so that it does not have to start from scratch.
In the TF docs, they have initialized the initial_state as zero_state matrix.
If you don't set the initial_state, it will be trained from scratch as other weight matrices do.
Is it only required in a single RNN cell and not in a stacked cell like in the example provided in the link?
I exactly don't know that why haven't they set the initial_state in the Stacked RNN example, but initial_state is required in every type of RNN as it holds the preserves the temporal features across time steps.
Maybe, Stacked RNN was the point of interest in the docs and not the settings of initial_state.
Tip:
In most cases, you will not need to set the initial_state for an RNN. TensorFlow can handle this efficiently for us. In the case of seq2seq RNN, this property may be used.
Your RNN maybe facing some other issue. Your RNN build ups its own memory and doesn't require powerup.

TensorFlow's RNN units and cells

The constructor for many of the RNN classes (BasicRNNCell, LSTMCell, and so on) accepts an argument named num_units. This sets the number of units in the cell.
I thought this identified the number of elements the RNN should process in sequence. So if you want an RNN to process sequences of length N, you'd have N units per cell. Is this correct? What exactly is an RNN unit?
No, it's not correct.
num_units refers to the number of features your cells can represent. At each time step, you give an input of a certain size (that you are calling "the number of elements the RNN should process in sequence"). This is like the layer 0 of your neural network. This input is then processed into a hidden layer, with size num_units. This is also the size of the cell output.
What you call N, is set by the size of your inputs tensor. num_units is a hyperparameter of your model. The bigger it is, the more degrees of freedom your model has (more descriptive features).
here num_units refers to the number of units in LSTM(or rnn) cell.
num_units can be interpreted as the analogy of hidden layer from the feed forward neural network.The number of nodes in hidden layer of a feed forward neural network is equivalent to num_units number of LSTM units in a LSTM cell at every time step of the network.Following picture should clear any confusion-
enter image description here
(cited from https://jasdeep06.github.io/posts/Understanding-LSTM-in-Tensorflow-MNIST/

What does size of the GRU or LSTM cell in the TensorFlow seq2seq tutorial represent?

I'm working with the seq2seq model in the TensorFlow tutorials, and I'm having trouble understanding some of the details. One thing that is confusing to me is what the "size" of a cell represents. I think I have a high level understanding of images like
I believe this is showing that the output from the last step in the encoder is the input to the first step in the encoder. In this case each box is the GRU or LSTM cell at a different time-step in the sequence.
I also think I understand, at a superficial level, diagrams like this:
from colah's blog post about LSTM and GRU cells. My understanding is that a "cell" is a neural network that feeds the output from one step back into itself along with the new input for the subsequent step. The gates control how much it "remembers" and "forgets."
I think I am getting confused at the level between this superficial, high level understanding and the low-level details. It sounds like the "size" of a cell is the number of nodes in the sigmoid and tanh boxes. Is that correct? If so, how does that relate to the input size for the seq2seq model? For example, the default vocabulary size is 40,000, and the default cell size is 1024. How does the 40,000 element one-hot vocabulary vector for each step of the sequence get matched to the 1024 node internal cell size? Is that what the embedding wrapper does?
Most importantly, what effect would increasing or decreasing the size of the cell have? Would a larger cell be better at learning embeddings? Or at predicting outputs? Both?
It sounds like the "size" of a cell is the number of nodes in the
sigmoid and tanh boxes. Is that correct?
The size of the cell is the size of the RNN state vector h. In the case of LSTM it's also the size of c. It's not "the number of nodes" (I'm not sure what you mean by nodes).
If so, how does that relate to the input size for the seq2seq model?
For example, the default vocabulary size is 40,000, and the default
cell size is 1024. How does the 40,000 element one-hot vocabulary
vector for each step of the sequence get matched to the 1024 node
internal cell size?
The input size for the model is independent of the state size. The two vectors (input and state) are concatenated and multiplied by a matrix of shape [state_size + input_size, state_size] to get the next state (simplified version).
Is that what the embedding wrapper does?
No, the embedding is the result of multiplying the 1-hot input vector with a matrix of size [vocab_size, input_size], before doing the multiplication.

What is num_units in tensorflow BasicLSTMCell?

In MNIST LSTM examples, I don't understand what "hidden layer" means. Is it the imaginary-layer formed when you represent an unrolled RNN over time?
Why is the num_units = 128 in most cases ?
From this brilliant article
num_units can be interpreted as the analogy of hidden layer from the feed forward neural network. The number of nodes in hidden layer of a feed forward neural network is equivalent to num_units number of LSTM units in a LSTM cell at every time step of the network.
See the image there too!
The number of hidden units is a direct representation of the learning capacity of a neural network -- it reflects the number of learned parameters. The value 128 was likely selected arbitrarily or empirically. You can change that value experimentally and rerun the program to see how it affects the training accuracy (you can get better than 90% test accuracy with a lot fewer hidden units). Using more units makes it more likely to perfectly memorize the complete training set (although it will take longer, and you run the risk of over-fitting).
The key thing to understand, which is somewhat subtle in the famous Colah's blog post (find "each line carries an entire vector"), is that X is an array of data (nowadays often called a tensor) -- it is not meant to be a scalar value. Where, for example, the tanh function is shown, it is meant to imply that the function is broadcast across the entire array (an implicit for loop) -- and not simply performed once per time-step.
As such, the hidden units represent tangible storage within the network, which is manifest primarily in the size of the weights array. And because an LSTM actually does have a bit of it's own internal storage separate from the learned model parameters, it has to know how many units there are -- which ultimately needs to agree with the size of the weights. In the simplest case, an RNN has no internal storage -- so it doesn't even need to know in advance how many "hidden units" it is being applied to.
A good answer to a similar question here.
You can look at the source for BasicLSTMCell in TensorFlow to see exactly how this is used.
Side note: This notation is very common in statistics and machine-learning, and other fields that process large batches of data with a common formula (3D graphics is another example). It takes a bit of getting used to for people who expect to see their for loops written out explicitly.
The argument n_hidden of BasicLSTMCell is the number of hidden units of the LSTM.
As you said, you should really read Colah's blog post to understand LSTM, but here is a little heads up.
If you have an input x of shape [T, 10], you will feed the LSTM with the sequence of values from t=0 to t=T-1, each of size 10.
At each timestep, you multiply the input with a matrix of shape [10, n_hidden], and get a n_hidden vector.
Your LSTM gets at each timestep t:
the previous hidden state h_{t-1}, of size n_hidden (at t=0, the previous state is [0., 0., ...])
the input, transformed to size n_hidden
it will sum these inputs and produce the next hidden state h_t of size n_hidden
From Colah's blog post:
If you just want to have code working, just keep with n_hidden = 128 and you will be fine.
An LSTM keeps two pieces of information as it propagates through time:
A hidden state; which is the memory the LSTM accumulates using its (forget, input, and output) gates through time, and
The previous time-step output.
Tensorflow’s num_units is the size of the LSTM’s hidden state (which is also the size of the output if no projection is used).
To make the name num_units more intuitive, you can think of it as the number of hidden units in the LSTM cell, or the number of memory units in the cell.
Look at this awesome post for more clarity
Since I had some problems to combine the information from the different sources I created the graphic below which shows a combination of the blog post (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) and (https://jasdeep06.github.io/posts/Understanding-LSTM-in-Tensorflow-MNIST/) where I think the graphics are very helpful but an error in explaining the number_units is present.
Several LSTM cells form one LSTM layer. This is shown in the figure below. Since you are mostly dealing with data that is very extensive, it is not possible to incorporate everything in one piece into the model. Therefore, data is divided into small pieces as batches, which are processed one after the other until the batch containing the last part is read in. In the lower part of the figure you can see the input (dark grey) where the batches are read in one after the other from batch 1 to batch batch_size. The cells LSTM cell 1 to LSTM cell time_step above represent the described cells of the LSTM model (http://colah.github.io/posts/2015-08-Understanding-LSTMs/). The number of cells is equal to the number of fixed time steps. For example, if you take a text sequence with a total of 150 characters, you could divide it into 3 (batch_size) and have a sequence of length 50 per batch (number of time_steps and thus of LSTM cells). If you then encoded each character one-hot, each element (dark gray boxes of the input) would represent a vector that would have the length of the vocabulary (number of features). These vectors would flow into the neuronal networks (green elements in the cells) in the respective cells and would change their dimension to the length of the number of hidden units (number_units). So the input has the dimension (batch_size x time_step x features). The Long Time Memory (Cell State) and Short Time Memory (Hidden State) have the same dimensions (batch_size x number_units). The light gray blocks that arise from the cells have a different dimension because the transformations in the neural networks (green elements) took place with the help of the hidden units (batch_size x time_step x number_units). The output can be returned from any cell but mostly only the information from the last block (black border) is relevant (not in all problems) because it contains all information from the previous time steps.
I think it is confusing for TF users by the term "num_hidden". Actually it has nothing to do with the unrolled LSTM cells, and it just is the dimension of the tensor, which is transformed from the time-step input tensor to and fed into the LSTM cell.
This term num_units or num_hidden_units sometimes noted using the variable name nhid in the implementations, means that the input to the LSTM cell is a vector of dimension nhid (or for a batched implementation, it would a matrix of shape batch_size x nhid). As a result, the output (from LSTM cell) would also be of same dimensionality since RNN/LSTM/GRU cell doesn't alter the dimensionality of the input vector or matrix.
As pointed out earlier, this term was borrowed from Feed-Forward Neural Networks (FFNs) literature and has caused confusion when used in the context of RNNs. But, the idea is that even RNNs can be viewed as FFNs at each time step. In this view, the hidden layer would indeed be containing num_hidden units as depicted in this figure:
Source: Understanding LSTM
More concretely, in the below example the num_hidden_units or nhid would be 3 since the size of hidden state (middle layer) is a 3D vector.
I think this is a correctly answer for your question. LSTM always make confusion.
You can refer this blog for more detail Animated RNN, LSTM and GRU
Most LSTM/RNN diagrams just show the hidden cells but never the units of those cells. Hence, the confusion.
Each hidden layer has hidden cells, as many as the number of time steps.
And further, each hidden cell is made up of multiple hidden units, like in the diagram below. Therefore, the dimensionality of a hidden layer matrix in RNN is (number of time steps, number of hidden units).
The Concept of hidden unit is illustrated in this image https://imgur.com/Fjx4Zuo.
Following #SangLe answer, I made a picture (see sources for original pictures) showing cells as classically represented in tutorials (Source1: Colah's Blog) and an equivalent cell with 2 units (Source2: Raimi Karim 's post). Hope it will clarify confusion between cells/units and what really is the network architecture.