What is num_units in tensorflow BasicLSTMCell? - tensorflow

In MNIST LSTM examples, I don't understand what "hidden layer" means. Is it the imaginary-layer formed when you represent an unrolled RNN over time?
Why is the num_units = 128 in most cases ?

From this brilliant article
num_units can be interpreted as the analogy of hidden layer from the feed forward neural network. The number of nodes in hidden layer of a feed forward neural network is equivalent to num_units number of LSTM units in a LSTM cell at every time step of the network.
See the image there too!

The number of hidden units is a direct representation of the learning capacity of a neural network -- it reflects the number of learned parameters. The value 128 was likely selected arbitrarily or empirically. You can change that value experimentally and rerun the program to see how it affects the training accuracy (you can get better than 90% test accuracy with a lot fewer hidden units). Using more units makes it more likely to perfectly memorize the complete training set (although it will take longer, and you run the risk of over-fitting).
The key thing to understand, which is somewhat subtle in the famous Colah's blog post (find "each line carries an entire vector"), is that X is an array of data (nowadays often called a tensor) -- it is not meant to be a scalar value. Where, for example, the tanh function is shown, it is meant to imply that the function is broadcast across the entire array (an implicit for loop) -- and not simply performed once per time-step.
As such, the hidden units represent tangible storage within the network, which is manifest primarily in the size of the weights array. And because an LSTM actually does have a bit of it's own internal storage separate from the learned model parameters, it has to know how many units there are -- which ultimately needs to agree with the size of the weights. In the simplest case, an RNN has no internal storage -- so it doesn't even need to know in advance how many "hidden units" it is being applied to.
A good answer to a similar question here.
You can look at the source for BasicLSTMCell in TensorFlow to see exactly how this is used.
Side note: This notation is very common in statistics and machine-learning, and other fields that process large batches of data with a common formula (3D graphics is another example). It takes a bit of getting used to for people who expect to see their for loops written out explicitly.

The argument n_hidden of BasicLSTMCell is the number of hidden units of the LSTM.
As you said, you should really read Colah's blog post to understand LSTM, but here is a little heads up.
If you have an input x of shape [T, 10], you will feed the LSTM with the sequence of values from t=0 to t=T-1, each of size 10.
At each timestep, you multiply the input with a matrix of shape [10, n_hidden], and get a n_hidden vector.
Your LSTM gets at each timestep t:
the previous hidden state h_{t-1}, of size n_hidden (at t=0, the previous state is [0., 0., ...])
the input, transformed to size n_hidden
it will sum these inputs and produce the next hidden state h_t of size n_hidden
From Colah's blog post:
If you just want to have code working, just keep with n_hidden = 128 and you will be fine.

An LSTM keeps two pieces of information as it propagates through time:
A hidden state; which is the memory the LSTM accumulates using its (forget, input, and output) gates through time, and
The previous time-step output.
Tensorflow’s num_units is the size of the LSTM’s hidden state (which is also the size of the output if no projection is used).
To make the name num_units more intuitive, you can think of it as the number of hidden units in the LSTM cell, or the number of memory units in the cell.
Look at this awesome post for more clarity

Since I had some problems to combine the information from the different sources I created the graphic below which shows a combination of the blog post (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) and (https://jasdeep06.github.io/posts/Understanding-LSTM-in-Tensorflow-MNIST/) where I think the graphics are very helpful but an error in explaining the number_units is present.
Several LSTM cells form one LSTM layer. This is shown in the figure below. Since you are mostly dealing with data that is very extensive, it is not possible to incorporate everything in one piece into the model. Therefore, data is divided into small pieces as batches, which are processed one after the other until the batch containing the last part is read in. In the lower part of the figure you can see the input (dark grey) where the batches are read in one after the other from batch 1 to batch batch_size. The cells LSTM cell 1 to LSTM cell time_step above represent the described cells of the LSTM model (http://colah.github.io/posts/2015-08-Understanding-LSTMs/). The number of cells is equal to the number of fixed time steps. For example, if you take a text sequence with a total of 150 characters, you could divide it into 3 (batch_size) and have a sequence of length 50 per batch (number of time_steps and thus of LSTM cells). If you then encoded each character one-hot, each element (dark gray boxes of the input) would represent a vector that would have the length of the vocabulary (number of features). These vectors would flow into the neuronal networks (green elements in the cells) in the respective cells and would change their dimension to the length of the number of hidden units (number_units). So the input has the dimension (batch_size x time_step x features). The Long Time Memory (Cell State) and Short Time Memory (Hidden State) have the same dimensions (batch_size x number_units). The light gray blocks that arise from the cells have a different dimension because the transformations in the neural networks (green elements) took place with the help of the hidden units (batch_size x time_step x number_units). The output can be returned from any cell but mostly only the information from the last block (black border) is relevant (not in all problems) because it contains all information from the previous time steps.

I think it is confusing for TF users by the term "num_hidden". Actually it has nothing to do with the unrolled LSTM cells, and it just is the dimension of the tensor, which is transformed from the time-step input tensor to and fed into the LSTM cell.

This term num_units or num_hidden_units sometimes noted using the variable name nhid in the implementations, means that the input to the LSTM cell is a vector of dimension nhid (or for a batched implementation, it would a matrix of shape batch_size x nhid). As a result, the output (from LSTM cell) would also be of same dimensionality since RNN/LSTM/GRU cell doesn't alter the dimensionality of the input vector or matrix.
As pointed out earlier, this term was borrowed from Feed-Forward Neural Networks (FFNs) literature and has caused confusion when used in the context of RNNs. But, the idea is that even RNNs can be viewed as FFNs at each time step. In this view, the hidden layer would indeed be containing num_hidden units as depicted in this figure:
Source: Understanding LSTM
More concretely, in the below example the num_hidden_units or nhid would be 3 since the size of hidden state (middle layer) is a 3D vector.

I think this is a correctly answer for your question. LSTM always make confusion.
You can refer this blog for more detail Animated RNN, LSTM and GRU

Most LSTM/RNN diagrams just show the hidden cells but never the units of those cells. Hence, the confusion.
Each hidden layer has hidden cells, as many as the number of time steps.
And further, each hidden cell is made up of multiple hidden units, like in the diagram below. Therefore, the dimensionality of a hidden layer matrix in RNN is (number of time steps, number of hidden units).

The Concept of hidden unit is illustrated in this image https://imgur.com/Fjx4Zuo.

Following #SangLe answer, I made a picture (see sources for original pictures) showing cells as classically represented in tutorials (Source1: Colah's Blog) and an equivalent cell with 2 units (Source2: Raimi Karim 's post). Hope it will clarify confusion between cells/units and what really is the network architecture.

Related

Large trainable embedding layer slows down training

I am training a network to classify text with a LSTM. I use a randomly initialized and trainable embedding layer for the word inputs. The network is trained with the Adam Optimizer and the words are fed into the network with a one-hot-encoding.
I noticed that the number of words which are represented in the embedding layer influences heavily the training time, but I don't understand why. Increasing the number of words in the network from 200'000 to 2'000'000 almost doubled the time for a training epoch.
Shouldn't the training only update weights which where used during the prediction of the current data point. Thus if my input sequence has always the same length, there should always happen the same number of updates, regardless of the size of the embedding layer.
The number of updates needed would be reflected in the number of epochs it takes to reach a certain precision.
If your observation is that convergence takes the same number of epochs, but each epoch takes twice as much wall clock time, then it's an indication that simply performing the embedding lookup (and writing the update of embedding table) now takes a significant part of your training time.
Which could easily be the case. 2'000'000 words times 4 bytes per float32 times the length of your embedding vector (what is it? let's assume 200) is something like 1.6 gigabytes of data that needs to be touched every minibatch. You're also not saying how you're training this (CPU, GPU, what GPU) which has a meaningful impact on how this should turn out because of e.g. cache effects, as for CPU doing the exact same number of reads/writes in a slightly less cache-friendly manner (more sparsity) can easily double the execution time.
Also, your premise is a bit unusual. How much labeled data do you have that would have enough examples of the #2000000th rarest word to calculate a meaningful embedding directly? It's probably possible, but would be unusual, in pretty much all datasets, including very large ones, the #2000000th word would be a nonce and thus it'd be harmful to include it in trainable embeddings. The usual scenario would be to calculate large embeddings separately from large unlabeled data and use that as a fixed untrainable layer, and possibly concatenate them with small trainable embeddings from labeled data to capture things like domain-specific terminology.
If I understand correctly, your network takes one-hot vectors representing words to embeddings of some size embedding_size. Then the embeddings are fed as input to an LSTM. The trainable variables of the network are both those of the embedding layer and the LSTM itself.
You are correct regarding the update of the weights in the embedding layer. However, the number of weights in one LSTM cell depends on the size of the embedding. If you look for example at the equation for the forget gate of the t-th cell,
you can see that the matrix of weights W_f is multiplied by the input x_t, meaning that one of the dimensions of W_f must be exactly embedding_size. So as embedding_size grows, so does the network size, so it takes longer to train.

What are the effects of padding a tensor?

I'm working on a problem using Keras that has been presenting me with issues:
My X data is all of shape (num_samples, 8192, 8), but my Y data is of shape (num_samples, 4), where 4 is a one-hot encoded vector.
Both X and Y data will be run through LSTM layers, but the layers are rejecting the Y data because it doesn't match the shape of the X data.
Is padding the Y data with 0s so that it matches the dimensions of the X data unreasonable? What kind of effects would that have? Is there a better solution?
Edited for clarification:
As requested, here is more information:
My Y data represents the expected output of passing the X data through my model. This is my first time working with LSTMs, so I don't have an architecture in mind, but I'd like to use an architecture that works well with classifying long (8192-length) sequences of words into one of several categories. Additionally, the dataset that I have is of an immense size when fed through an LSTM, so I'm currently using batch-training.
Technologies being used:
Keras (Tensorflow Backend)
TL;DR Is padding one tensor with zeroes in all dimensions to match another tensor's shape a bad idea? What could be a better approach?
First of all, let's make sure your representation is actually what you think it is; the input to an LSTM (or any recurrent layer, for that matter) must be of dimensionality: (timesteps, shape), i.e. if you have 1000 training samples, each consisting of 100 timesteps, with each timestep having 10 values, your input shape will be (100,10,). Therefore I assume from your question that each input sample in your X set has 8192 steps and 8 values per step. Great; a single LSTM layer can iterate over these and produce 4-dimensional representations with absolutely no problem, just like so:
myLongInput = Input(shape=(8192,8,))
myRecurrentFunction = LSTM(4)
myShortOutput = myRecurrentFunction(myLongInput)
myShortOutput.shape
TensorShape([Dimension(None), Dimension(4)])
I assume your problem stems from trying to apply yet another LSTM on top of the first one; the next LSTM expects a tensor that has a time dimension, but your output has none. If that is the case, you'll need to let your first LSTM also output the intermediate representations at each time step, like so:
myNewRecurrentFunction=LSTM(4, return_sequences=True)
myLongOutput = myNewRecurrentFunction(myLongInput)
myLongOutput.shape
TensorShape([Dimension(None), Dimension(None), Dimension(4)])
As you can see the new output is now a 3rd order tensor, with the second dimension now being the (yet unassigned) timesteps. You can repeat this process until your final output, where you usually don't need the intermediate representations but rather only the last one. (Sidenote: make sure to set the activation of your last layer to a softmax if your output is in one-hot format)
On to your original question, zero-padding has very little negative impact on your network. The network will strain itself a bit in the beginning trying to figure out the concept of the additional values you have just thrown at it, but will very soon be able to learn they're meaningless. This comes at a cost of a larger parameter space (therefore more time and memory complexity), but doesn't really affect predictive power most of the time.
I hope that was helpful.

TensorFlow's RNN units and cells

The constructor for many of the RNN classes (BasicRNNCell, LSTMCell, and so on) accepts an argument named num_units. This sets the number of units in the cell.
I thought this identified the number of elements the RNN should process in sequence. So if you want an RNN to process sequences of length N, you'd have N units per cell. Is this correct? What exactly is an RNN unit?
No, it's not correct.
num_units refers to the number of features your cells can represent. At each time step, you give an input of a certain size (that you are calling "the number of elements the RNN should process in sequence"). This is like the layer 0 of your neural network. This input is then processed into a hidden layer, with size num_units. This is also the size of the cell output.
What you call N, is set by the size of your inputs tensor. num_units is a hyperparameter of your model. The bigger it is, the more degrees of freedom your model has (more descriptive features).
here num_units refers to the number of units in LSTM(or rnn) cell.
num_units can be interpreted as the analogy of hidden layer from the feed forward neural network.The number of nodes in hidden layer of a feed forward neural network is equivalent to num_units number of LSTM units in a LSTM cell at every time step of the network.Following picture should clear any confusion-
enter image description here
(cited from https://jasdeep06.github.io/posts/Understanding-LSTM-in-Tensorflow-MNIST/

Understanding Seq2Seq model

Here is my understanding of a basic Sequence to Sequence LSTMs. Suppose we are tackling a question-answer setting.
You have two set of LSTMs (green and blue below). Each set respectively sharing weights (i.e. each of the 4 green cells have the same weights and similarly with the blue cells). The first is a many to one LSTM, which summarises the question at the last hidden layer/ cell memory.
The second set (blue) is a Many to Many LSTM which has different weights to the first set of LSTMs. The input is simply the answer sentence while the output is the same sentence shifted by one.
The question is two fold:
1. Are we passing the last hidden state only to the blue LSTMs as the initial hidden state. Or is it last hidden state and cell memory.
2. Is there a way to set the initial hiddden state and cell memory in Keras or Tensorflow? If so reference?
(image taken from suriyadeepan.github.io)
Are we passing the last hidden state only to the blue LSTMs as the initial hidden state. Or is it last hidden state and cell memory.
Both hidden state h and cell memory c are passed to the decoder.
TensorFlow
In seq2seq source code, you can find the following code in basic_rnn_seq2seq():
_, enc_state = rnn.static_rnn(enc_cell, encoder_inputs, dtype=dtype)
return rnn_decoder(decoder_inputs, enc_state, cell)
If you use an LSTMCell, the returned enc_state from the encoder will be a tuple (c, h). As you can see, the tuple is passed directly to the decoder.
Keras
In Keras, the "state" defined for an LSTMCell is also a tuple (h, c) (note that the order is different from TF). In LSTMCell.call(), you can find:
h_tm1 = states[0]
c_tm1 = states[1]
To get the states returned from an LSTM layer, you can specify return_state=True. The returned value is a tuple (o, h, c). The tensor o is the output of this layer, which will be equal to h unless you specify return_sequences=True.
Is there a way to set the initial hiddden state and cell memory in Keras or Tensorflow? If so reference?
###TensorFlow###
Just provide the initial state to an LSTMCell when calling it. For example, in the official RNN tutorial:
lstm = tf.contrib.rnn.BasicLSTMCell(lstm_size)
...
output, state = lstm(current_batch_of_words, state)
There's also an initial_state argument for functions such as tf.nn.static_rnn. If you use the seq2seq module, provide the states to rnn_decoder as have been shown in the code for question 1.
###Keras###
Use the keyword argument initial_state in the LSTM function call.
out = LSTM(32)(input_tensor, initial_state=(h, c))
You can actually find this usage on the official documentation:
###Note on specifying the initial state of RNNs###
You can specify the initial state of RNN layers symbolically by
calling them with the keyword argument initial_state. The value of
initial_state should be a tensor or list of tensors representing the
initial state of the RNN layer.
EDIT:
There's now an example script in Keras (lstm_seq2seq.py) showing how to implement basic seq2seq in Keras. How to make prediction after training a seq2seq model is also covered in this script.
(Edit: this answer is incomplete and hasn't considered actual possibilities of state transfering. See the accepted answer).
From a Keras point of view, that picture has only two layers.
The green group is one LSTM layer.
The blue group is another LSTM layer.
There isn't any communication between green and blue other than passing the outputs. So, the answer for 1 is:
Only the thought vector (which is the actual output of the layer) is passed to the other layer.
Memory and state (not sure if these are two different entities) are totally contained inside a single layer and are not initially intended to be seen or shared with any other layer.
Each individual block in that image is totally invisible in keras. They are considered "time steps", something that only appears in the shape of the input data. It's rarely important to worry about them (unless for very advanced usages).
In keras, it's like this:
Easily, you have access only to the external arrows (including "thought vector").
But having access to each step (each individual green block in your picture) is not an exposed thing. So...
Passing the states from one layer to the other is also not expected in Keras. You will probably have to hack things. (See this: https://github.com/fchollet/keras/issues/2995)
But considering a thought vector big enough, you could say it will learn a way to carry what is important in itself.
The only notion you have from the steps is:
You have to input things shaped like (sentences, length, wordIdFeatures)
The steps will be performed considering that each slice in the length dimension is an input to each green block.
You may choose to have a single output (sentences, cells), for which you completely lose track of steps. Or...
Outputs like (sentences, length, cells), from which you know the output of each block through the length dimension.
One to many or many to many?
Now, the first layer is many to one (but nothing prevents it from being many to many too if you want).
But the second... that's complicated.
If the thought vector was made by a many to one. You will have to manage a way of creating a one to many. (That's not trivial in keras, but you could think of repeating the thought vector for the expected length, making it be the input to all steps. Or maybe fill an entire sequence with zeros or ones, keeping only the first element as the thought vector)
If the thought vector was made by a many to many, you can take advantage of this and keep an easy many to many, if you're willing to accept that the output has exactly the same number of steps as the input.
Keras doesn't have a ready solution for 1 to many cases. (From a single input predict a whole sequence).

What does size of the GRU or LSTM cell in the TensorFlow seq2seq tutorial represent?

I'm working with the seq2seq model in the TensorFlow tutorials, and I'm having trouble understanding some of the details. One thing that is confusing to me is what the "size" of a cell represents. I think I have a high level understanding of images like
I believe this is showing that the output from the last step in the encoder is the input to the first step in the encoder. In this case each box is the GRU or LSTM cell at a different time-step in the sequence.
I also think I understand, at a superficial level, diagrams like this:
from colah's blog post about LSTM and GRU cells. My understanding is that a "cell" is a neural network that feeds the output from one step back into itself along with the new input for the subsequent step. The gates control how much it "remembers" and "forgets."
I think I am getting confused at the level between this superficial, high level understanding and the low-level details. It sounds like the "size" of a cell is the number of nodes in the sigmoid and tanh boxes. Is that correct? If so, how does that relate to the input size for the seq2seq model? For example, the default vocabulary size is 40,000, and the default cell size is 1024. How does the 40,000 element one-hot vocabulary vector for each step of the sequence get matched to the 1024 node internal cell size? Is that what the embedding wrapper does?
Most importantly, what effect would increasing or decreasing the size of the cell have? Would a larger cell be better at learning embeddings? Or at predicting outputs? Both?
It sounds like the "size" of a cell is the number of nodes in the
sigmoid and tanh boxes. Is that correct?
The size of the cell is the size of the RNN state vector h. In the case of LSTM it's also the size of c. It's not "the number of nodes" (I'm not sure what you mean by nodes).
If so, how does that relate to the input size for the seq2seq model?
For example, the default vocabulary size is 40,000, and the default
cell size is 1024. How does the 40,000 element one-hot vocabulary
vector for each step of the sequence get matched to the 1024 node
internal cell size?
The input size for the model is independent of the state size. The two vectors (input and state) are concatenated and multiplied by a matrix of shape [state_size + input_size, state_size] to get the next state (simplified version).
Is that what the embedding wrapper does?
No, the embedding is the result of multiplying the 1-hot input vector with a matrix of size [vocab_size, input_size], before doing the multiplication.