Let's take a fully-connected neural network with one hidden layer as an example. The input layer consists of 5 units that are each connected to all hidden neurons. In total there are 10 hidden neurons.
Libraries such as Theano and Tensorflow allow multidimensional input/output shapes. For example, we could use sentences of 5 words where each word is represented by a 300d vector.
How is such an input mapped on the described neural network? I do not understand what an ouptut shape of (None, 5, 300) (just an example) means. In my imagination we just have a bunch of neurons through which single numbers flow.
When I have an output shape of (None, 5, 300), how much neurons do I have in the corresponding network? How do I connect the words to my neural network?
Yes, we just have a bunch of neurons throuhg which single numbers flow.
But: if you must give your network 5 numbers as input, it's then convenient to give these numbers in an array with length 5.
And if you're giving 30 thousand examples for your network to train, then it's convenient to create an array with 30 thousand elements, each element being an array of 5 numbers.
In the end, this input with 30 thousand examples of 5 numbers is an array with shape (30000,5).
Each layer then has it's own output shape. Each layer's output is certainly related to its own amount of neurons. Each neuron will throw out a number (or sometimes an array, depending on which layer type you're using). But 10 neurons together will throw out 10 numbers, which will then be packed in an array shaped (30000,10).
The word "None" in those shapes is related to the batch size (the amount of examples you give for training or predicting). You don't define that number, it is automatically understood when you pass a batch.
Looking at your network:
When you have an input of 5 units, you got an input shape of (None,5). But you actually say only (5,) to your model, because the None part is the batch size, which will only appear when training.
This number means: you have to give your network an array with a number of samples, each sample being an array of 5 numbers.
Then, your hidden layer with 10 neurons will calculate and give you 10 numbers as output, in an array shaped as (None, 10).
What is a (None,5,300)?
If you're saying that each word is a 300d vector, there are a few different ways to translate a word in that.
One of the common ways is: how many words you have in your dictionary?
If you have a dictionary with 300 words, you can then make each word be a vector with 300 elements, being all zeros, except for one of them.
Say word "hello" is the first word in your dictionary, it's vector will be [1,0,0,0, ...., 0]
Say word "my" is the second word in your dictionary, it's vector will be [0,1,0,0, ...., 0]
And the word "fly" is the last one in the dictionary, it's vector will be [0,0,0,0, ...., 1]
You do this for your entire dictionary, and whenever you have to pass the word "hello" to your network, you will pass [1,0,0,0 ..., 0] instead.
A sentence with five words will then be an array with five of these arrays. This means, a sentence with five words will be shaped as (5, 300). If you pass 30 thousand sentences as examples: (30000,5,300). In the model, "None" appears as the batch size (None, 5, 300)
There are also other options, such as creating a word Embedding, which will translate the words into vectors of meanings. Meanings which only the network will understand. (There is the Embedding layer on Keras for that).
There are also things called CBOW (continous bag of words).
You have to know what you want to do first, so you can translate your words in some array that fits the network's requirements.
How many neurons do I have for an output of (None,5,300)?
This only tells you about the last layer. The other layers' outputs were all calculated and packed together by the following layers, which changed the output. Each layer has its own output. (When you have a model, you can do a model.summary() and see the output of each layer.)
Even though, it's impossible to answer that question without knowing which types of layers you're using.
There are layers such as Dense that throw out things like (BatchSize,NumberOfNeurons)
But there are layers such as Convolution2D that throw out things like (BatchSize, numberOfChannels, pixelsInX, pixelsInY). For instance, a regular image has three channels: red, blue and green. An array for passing a regular image would be like (3,sizeX,sizeY).
It all depends on which layer type you're using.
Using a word embedding
For using an embedding, it's interesting to read keras documentation about it.
For that you will have to transform your words in indices.
Instead of saying that each word in your dictionary is a vector, you say it's a number.
Word "hello" is 1
Word "my" is 2
Word "fly" is theSizeOfYourDictionary
If you want each sentence to have 100 words, then your input shape will be (None, 100). Where each array of 100 numbers contains numbers representing the words in your dictionary.
The first layer in your model will be an Embedding layer.
model = Sequential()
model.add(Embedding(theSizeOfYourDictionary, 300, input_length=100)
This way, you're creating vectors of size 300 for each word, passing sequences of 100 words. (I'm not used to embeddings, but it seems 300 is a big number, it could be less).
The output of this embedding will be (None, 100, 300).
Then you connect other layers after it.
Related
I am new to neural network here. I am reading a lot of guides and tutorial where they will start with an lstm layer where the input size differs from the output size
eg. model.add(LSTM(100, input_shape=(20, 1))) ->
before doing ->
model.add(Dense(80, activation='relu')), etc.
presumably, the output layer for the lstm here has size 100, where the input has only 20
for a dense layer I can imagine how that works because there are plenty of graphs depicting that, but how can a lstm produce output layer of very different size from the input?
and also importantly, of what range of value can the output be given the input (let's say of 20) effectively be? would any value make sense?
The output size can be anything. For example, in case of feeding word embeddings of 256 length and output size 1000 length, it somewhat follows the below steps:
Embedding goes into the LSTM (Here, I am ignoring the batch and sequence length; just one word embedding in one time-step)
The Weight Matrix (Waa, Way, Wax etc are initialized) : These matrices shapes depends upon the output size you gave (e.g. 100 above)
All the needed calculations are followed as per LSTM semantics
The output of 1000 vector length is generated
I'm working on a problem using Keras that has been presenting me with issues:
My X data is all of shape (num_samples, 8192, 8), but my Y data is of shape (num_samples, 4), where 4 is a one-hot encoded vector.
Both X and Y data will be run through LSTM layers, but the layers are rejecting the Y data because it doesn't match the shape of the X data.
Is padding the Y data with 0s so that it matches the dimensions of the X data unreasonable? What kind of effects would that have? Is there a better solution?
Edited for clarification:
As requested, here is more information:
My Y data represents the expected output of passing the X data through my model. This is my first time working with LSTMs, so I don't have an architecture in mind, but I'd like to use an architecture that works well with classifying long (8192-length) sequences of words into one of several categories. Additionally, the dataset that I have is of an immense size when fed through an LSTM, so I'm currently using batch-training.
Technologies being used:
Keras (Tensorflow Backend)
TL;DR Is padding one tensor with zeroes in all dimensions to match another tensor's shape a bad idea? What could be a better approach?
First of all, let's make sure your representation is actually what you think it is; the input to an LSTM (or any recurrent layer, for that matter) must be of dimensionality: (timesteps, shape), i.e. if you have 1000 training samples, each consisting of 100 timesteps, with each timestep having 10 values, your input shape will be (100,10,). Therefore I assume from your question that each input sample in your X set has 8192 steps and 8 values per step. Great; a single LSTM layer can iterate over these and produce 4-dimensional representations with absolutely no problem, just like so:
myLongInput = Input(shape=(8192,8,))
myRecurrentFunction = LSTM(4)
myShortOutput = myRecurrentFunction(myLongInput)
myShortOutput.shape
TensorShape([Dimension(None), Dimension(4)])
I assume your problem stems from trying to apply yet another LSTM on top of the first one; the next LSTM expects a tensor that has a time dimension, but your output has none. If that is the case, you'll need to let your first LSTM also output the intermediate representations at each time step, like so:
myNewRecurrentFunction=LSTM(4, return_sequences=True)
myLongOutput = myNewRecurrentFunction(myLongInput)
myLongOutput.shape
TensorShape([Dimension(None), Dimension(None), Dimension(4)])
As you can see the new output is now a 3rd order tensor, with the second dimension now being the (yet unassigned) timesteps. You can repeat this process until your final output, where you usually don't need the intermediate representations but rather only the last one. (Sidenote: make sure to set the activation of your last layer to a softmax if your output is in one-hot format)
On to your original question, zero-padding has very little negative impact on your network. The network will strain itself a bit in the beginning trying to figure out the concept of the additional values you have just thrown at it, but will very soon be able to learn they're meaningless. This comes at a cost of a larger parameter space (therefore more time and memory complexity), but doesn't really affect predictive power most of the time.
I hope that was helpful.
I am a newbie to Keras (and somehow to TF) but I have found shape definition for the input layer very confusing.
So in the examples, when we have a 1D vector of length 20 for input, shape gets defined as
...Input(shape=(20,)...)
And when a 2D tensor for greyscale images needs to be defined for MNIST, it is defined as:
...Input(shape=(28, 28, 1)...)
So my question is why the tensor is not defined as (20) and (28, 28)? Why in the first case a second dimension is added and left empty? Also in second, number of channels have to be defined?
I understand that it depends on the layer so Conv1D, Dense or Conv2D take different shapes but it seems the first parameter is implicit?
According to docs, Dense needs be (batch_size, ..., input_dim) but how is this related the example:
Dense(32, input_shape=(784,))
Thanks
Tuples vs numbers
input_shape must be a tuple, so only (20,) can satisfy it. The number 20 is not a tuple. -- There is the parameter input_dim, to make your life easier if you have only one dimension. This parameter can take 20. (But really, I find it just confusing, I always work with input_shape and use tuples, to keep a consistent understanding).
Dense(32, input_shape=(784,)) is the same as Dense(32, input_dim=784).
Images
Images don't have only pixels, they also have channels (red, green, blue).
A black/white image has only one channel.
So, (28pixels, 28pixels, 1channel)
But notice that there isn't any obligation to follow this shape for images everywhere. You can shape them the way you like. But some kinds of layers do demand a certain shape, otherwise they couldn't work.
Some layers demand specific shapes
It's the case of the 2D convolutional layers, which need (size1,size2,channels). They need this shape because they must apply the convolutional filters accordingly.
It's also the case of recurrent layers, which need (timeSteps,featuresPerStep) to perform their recurrent calculations.
MNIST models
Again, there isn't any obligation to shape your image in a specific way. You must do it according to which first layer you choose and what you intend to achieve. It's a free thing.
Many examples simply don't care about an image being a 2d structured thing, and they just use models that take 784 pixels. That's enough. They probably start with Dense layers, which demand shapes like (size,)
Other examples may care, and use a shape (28,28), but then these models will have to reshape the input to fit the needs of the next layer.
Convolutional layers 2D will demand (28,28,1).
The main idea is: input arrays must match input_shape or input_dim.
Tensor shapes
Be careful, though, when reading Keras error messages or working with custom / lambda layers.
All these shapes we defined before omit an important dimension: the batch size or the number of samples.
Internally all tensors will have this additional dimension as the first dimension. Keras will report it as None (a dimension that will adapt to any batch size you have).
So, input_shape=(784,) will be reported as (None,784).
And input_shape=(28,28,1) will be reported as (None,28,28,1)
And your actual input data must have a shape that matches that reported shape.
I am confused about the input vector in LSTM model, the data I am using is the text data, e.g. 1,000 sentences. I have two questions about the LSTM input layer:
1.If I would tokenize those sentences into the vectors (we can call it sentence vectors), is there a way in Keras to make sentence vectors given a document? Should be word level, right?
2.The second question is the 3D Tensor type in LSTM. I have 1,000 sentences (samples) and time_step would be 1 if I want to LSTM read one document at each time step, is that correct? The last one is the input dimension, this input dimension is the word dimension (100) in each sentence or how many word observed in each time step (10)?
Thus the LSTM tensor should be (1000, 1, 10) or (1000, 1, 100)
I don't know the answer to the first question. I have no experience working with Keras. However, I have some experience with working on Tensorflow.
For the second question, did you mean "I want to LSTM read one sentence at each time step". If your intention is to work with sentence sequence, I believe that is what you want to do as you are using LSTM, then you need to define the sequence length (number of sentences will be processed in one particular sequence). Sequence length can be defined by using time_step. So setting time_step = 1 would be incorrect. If you are working with sentence vectors then the last one is the vector dimension for sentence embedding.
For an example, if you have 1000 sentences in a document, each sentence is represented by a vector of size 100, and the sequence length is 5 (you want to process maximum 5 sentences in one sequence), then the tensor dimensions would be (None, 5,100). The first one is 'None' so that you can decide how many sequences you want to feed into the network later (minibatch learning).
The input format is (nb_samples, time_steps, input_dim). In your case, the number of samples is number of sentences. Time steps is the number of words in sentence, so this would be the number of words in the sentence that has maximum words (other sentences have to be padded to match this length). input_dimension is the number of features used to represent each word. For example if you use word2vec embedding say with 100 or 200 dimentions (features), that would be your input_dimension.
In MNIST LSTM examples, I don't understand what "hidden layer" means. Is it the imaginary-layer formed when you represent an unrolled RNN over time?
Why is the num_units = 128 in most cases ?
From this brilliant article
num_units can be interpreted as the analogy of hidden layer from the feed forward neural network. The number of nodes in hidden layer of a feed forward neural network is equivalent to num_units number of LSTM units in a LSTM cell at every time step of the network.
See the image there too!
The number of hidden units is a direct representation of the learning capacity of a neural network -- it reflects the number of learned parameters. The value 128 was likely selected arbitrarily or empirically. You can change that value experimentally and rerun the program to see how it affects the training accuracy (you can get better than 90% test accuracy with a lot fewer hidden units). Using more units makes it more likely to perfectly memorize the complete training set (although it will take longer, and you run the risk of over-fitting).
The key thing to understand, which is somewhat subtle in the famous Colah's blog post (find "each line carries an entire vector"), is that X is an array of data (nowadays often called a tensor) -- it is not meant to be a scalar value. Where, for example, the tanh function is shown, it is meant to imply that the function is broadcast across the entire array (an implicit for loop) -- and not simply performed once per time-step.
As such, the hidden units represent tangible storage within the network, which is manifest primarily in the size of the weights array. And because an LSTM actually does have a bit of it's own internal storage separate from the learned model parameters, it has to know how many units there are -- which ultimately needs to agree with the size of the weights. In the simplest case, an RNN has no internal storage -- so it doesn't even need to know in advance how many "hidden units" it is being applied to.
A good answer to a similar question here.
You can look at the source for BasicLSTMCell in TensorFlow to see exactly how this is used.
Side note: This notation is very common in statistics and machine-learning, and other fields that process large batches of data with a common formula (3D graphics is another example). It takes a bit of getting used to for people who expect to see their for loops written out explicitly.
The argument n_hidden of BasicLSTMCell is the number of hidden units of the LSTM.
As you said, you should really read Colah's blog post to understand LSTM, but here is a little heads up.
If you have an input x of shape [T, 10], you will feed the LSTM with the sequence of values from t=0 to t=T-1, each of size 10.
At each timestep, you multiply the input with a matrix of shape [10, n_hidden], and get a n_hidden vector.
Your LSTM gets at each timestep t:
the previous hidden state h_{t-1}, of size n_hidden (at t=0, the previous state is [0., 0., ...])
the input, transformed to size n_hidden
it will sum these inputs and produce the next hidden state h_t of size n_hidden
From Colah's blog post:
If you just want to have code working, just keep with n_hidden = 128 and you will be fine.
An LSTM keeps two pieces of information as it propagates through time:
A hidden state; which is the memory the LSTM accumulates using its (forget, input, and output) gates through time, and
The previous time-step output.
Tensorflow’s num_units is the size of the LSTM’s hidden state (which is also the size of the output if no projection is used).
To make the name num_units more intuitive, you can think of it as the number of hidden units in the LSTM cell, or the number of memory units in the cell.
Look at this awesome post for more clarity
Since I had some problems to combine the information from the different sources I created the graphic below which shows a combination of the blog post (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) and (https://jasdeep06.github.io/posts/Understanding-LSTM-in-Tensorflow-MNIST/) where I think the graphics are very helpful but an error in explaining the number_units is present.
Several LSTM cells form one LSTM layer. This is shown in the figure below. Since you are mostly dealing with data that is very extensive, it is not possible to incorporate everything in one piece into the model. Therefore, data is divided into small pieces as batches, which are processed one after the other until the batch containing the last part is read in. In the lower part of the figure you can see the input (dark grey) where the batches are read in one after the other from batch 1 to batch batch_size. The cells LSTM cell 1 to LSTM cell time_step above represent the described cells of the LSTM model (http://colah.github.io/posts/2015-08-Understanding-LSTMs/). The number of cells is equal to the number of fixed time steps. For example, if you take a text sequence with a total of 150 characters, you could divide it into 3 (batch_size) and have a sequence of length 50 per batch (number of time_steps and thus of LSTM cells). If you then encoded each character one-hot, each element (dark gray boxes of the input) would represent a vector that would have the length of the vocabulary (number of features). These vectors would flow into the neuronal networks (green elements in the cells) in the respective cells and would change their dimension to the length of the number of hidden units (number_units). So the input has the dimension (batch_size x time_step x features). The Long Time Memory (Cell State) and Short Time Memory (Hidden State) have the same dimensions (batch_size x number_units). The light gray blocks that arise from the cells have a different dimension because the transformations in the neural networks (green elements) took place with the help of the hidden units (batch_size x time_step x number_units). The output can be returned from any cell but mostly only the information from the last block (black border) is relevant (not in all problems) because it contains all information from the previous time steps.
I think it is confusing for TF users by the term "num_hidden". Actually it has nothing to do with the unrolled LSTM cells, and it just is the dimension of the tensor, which is transformed from the time-step input tensor to and fed into the LSTM cell.
This term num_units or num_hidden_units sometimes noted using the variable name nhid in the implementations, means that the input to the LSTM cell is a vector of dimension nhid (or for a batched implementation, it would a matrix of shape batch_size x nhid). As a result, the output (from LSTM cell) would also be of same dimensionality since RNN/LSTM/GRU cell doesn't alter the dimensionality of the input vector or matrix.
As pointed out earlier, this term was borrowed from Feed-Forward Neural Networks (FFNs) literature and has caused confusion when used in the context of RNNs. But, the idea is that even RNNs can be viewed as FFNs at each time step. In this view, the hidden layer would indeed be containing num_hidden units as depicted in this figure:
Source: Understanding LSTM
More concretely, in the below example the num_hidden_units or nhid would be 3 since the size of hidden state (middle layer) is a 3D vector.
I think this is a correctly answer for your question. LSTM always make confusion.
You can refer this blog for more detail Animated RNN, LSTM and GRU
Most LSTM/RNN diagrams just show the hidden cells but never the units of those cells. Hence, the confusion.
Each hidden layer has hidden cells, as many as the number of time steps.
And further, each hidden cell is made up of multiple hidden units, like in the diagram below. Therefore, the dimensionality of a hidden layer matrix in RNN is (number of time steps, number of hidden units).
The Concept of hidden unit is illustrated in this image https://imgur.com/Fjx4Zuo.
Following #SangLe answer, I made a picture (see sources for original pictures) showing cells as classically represented in tutorials (Source1: Colah's Blog) and an equivalent cell with 2 units (Source2: Raimi Karim 's post). Hope it will clarify confusion between cells/units and what really is the network architecture.