LSTM input in Keras - tensorflow

I am confused about the input vector in LSTM model, the data I am using is the text data, e.g. 1,000 sentences. I have two questions about the LSTM input layer:
1.If I would tokenize those sentences into the vectors (we can call it sentence vectors), is there a way in Keras to make sentence vectors given a document? Should be word level, right?
2.The second question is the 3D Tensor type in LSTM. I have 1,000 sentences (samples) and time_step would be 1 if I want to LSTM read one document at each time step, is that correct? The last one is the input dimension, this input dimension is the word dimension (100) in each sentence or how many word observed in each time step (10)?
Thus the LSTM tensor should be (1000, 1, 10) or (1000, 1, 100)

I don't know the answer to the first question. I have no experience working with Keras. However, I have some experience with working on Tensorflow.
For the second question, did you mean "I want to LSTM read one sentence at each time step". If your intention is to work with sentence sequence, I believe that is what you want to do as you are using LSTM, then you need to define the sequence length (number of sentences will be processed in one particular sequence). Sequence length can be defined by using time_step. So setting time_step = 1 would be incorrect. If you are working with sentence vectors then the last one is the vector dimension for sentence embedding.
For an example, if you have 1000 sentences in a document, each sentence is represented by a vector of size 100, and the sequence length is 5 (you want to process maximum 5 sentences in one sequence), then the tensor dimensions would be (None, 5,100). The first one is 'None' so that you can decide how many sequences you want to feed into the network later (minibatch learning).

The input format is (nb_samples, time_steps, input_dim). In your case, the number of samples is number of sentences. Time steps is the number of words in sentence, so this would be the number of words in the sentence that has maximum words (other sentences have to be padded to match this length). input_dimension is the number of features used to represent each word. For example if you use word2vec embedding say with 100 or 200 dimentions (features), that would be your input_dimension.

Related

How does Keras produce output of different size y, given an input of size x?

I am new to neural network here. I am reading a lot of guides and tutorial where they will start with an lstm layer where the input size differs from the output size
eg. model.add(LSTM(100, input_shape=(20, 1))) ->
before doing ->
model.add(Dense(80, activation='relu')), etc.
presumably, the output layer for the lstm here has size 100, where the input has only 20
for a dense layer I can imagine how that works because there are plenty of graphs depicting that, but how can a lstm produce output layer of very different size from the input?
and also importantly, of what range of value can the output be given the input (let's say of 20) effectively be? would any value make sense?
The output size can be anything. For example, in case of feeding word embeddings of 256 length and output size 1000 length, it somewhat follows the below steps:
Embedding goes into the LSTM (Here, I am ignoring the batch and sequence length; just one word embedding in one time-step)
The Weight Matrix (Waa, Way, Wax etc are initialized) : These matrices shapes depends upon the output size you gave (e.g. 100 above)
All the needed calculations are followed as per LSTM semantics
The output of 1000 vector length is generated

How to use Transformers for text classification?

I have two questions about how to use Tensorflow implementation of the Transformers for text classifications.
First, it seems people mostly used only the encoder layer to do the text classification task. However, encoder layer generates one prediction for each input word. Based on my understanding of transformers, the input to the encoder each time is one word from the input sentence. Then, the attention weights and the output is calculated using the current input word. And we can repeat this process for all of the words in the input sentence. As a result we'll end up with pairs of (attention weights, outputs) for each word in the input sentence. Is that correct? Then how would you use this pairs to perform a text classification?
Second, based on the Tensorflow implementation of transformer here, they embed the whole input sentence to one vector and feed a batch of these vectors to the Transformer. However, I expected the input to be a batch of words instead of sentences based on what I've learned from The Illustrated Transformer
Thank you!
There are two approaches, you can take:
Just average the states you get from the encoder;
Prepend a special token [CLS] (or whatever you like to call it) and use the hidden state for the special token as input to your classifier.
The second approach is used by BERT. When pre-training, the hidden state corresponding to this special token is used for predicting whether two sentences are consecutive. In the downstream tasks, it is also used for sentence classification. However, my experience is that sometimes, averaging the hidden states give a better result.
Instead of training a Transformer model from scratch, it is probably more convenient to use (and eventually finetune) a pre-trained model (BERT, XLNet, DistilBERT, ...) from the transformers package. It has pre-trained models ready to use in PyTorch and TensorFlow 2.0.
The Transformers are designed to take the whole input sentence at once. The main motive for designing a transformer was to enable parallel processing of the words in the sentences. This parallel processing is not possible in LSTMs or RNNs or GRUs as they take words of the input sentence as input one by one.
So in the encoder part of the transformers, the very first layer contains the number of units equal to the number of words in a sentence and then each unit converts that word into an embedding vector corresponding to that word. Further, the rest of the processes are carried out. For more details, you can go through the article: http://jalammar.github.io/illustrated-transformer/
How to use this transformer for text classification - Since in text classification our output is a single number not a sequence of numbers or vectors so we can remove the decoder part and just use the encoder part. The output of the encoder is a set of vectors, the same in number as the number of words in the input sentence. Further, we can feed these sets of output vectors into a CNN, or we can add an LSTM or RNN model and perform classification.
The input is the whole sentence or batch of sentences not word by word. Surely you would have misunderstood it.

Understanding input/output dimensions of neural networks

Let's take a fully-connected neural network with one hidden layer as an example. The input layer consists of 5 units that are each connected to all hidden neurons. In total there are 10 hidden neurons.
Libraries such as Theano and Tensorflow allow multidimensional input/output shapes. For example, we could use sentences of 5 words where each word is represented by a 300d vector.
How is such an input mapped on the described neural network? I do not understand what an ouptut shape of (None, 5, 300) (just an example) means. In my imagination we just have a bunch of neurons through which single numbers flow.
When I have an output shape of (None, 5, 300), how much neurons do I have in the corresponding network? How do I connect the words to my neural network?
Yes, we just have a bunch of neurons throuhg which single numbers flow.
But: if you must give your network 5 numbers as input, it's then convenient to give these numbers in an array with length 5.
And if you're giving 30 thousand examples for your network to train, then it's convenient to create an array with 30 thousand elements, each element being an array of 5 numbers.
In the end, this input with 30 thousand examples of 5 numbers is an array with shape (30000,5).
Each layer then has it's own output shape. Each layer's output is certainly related to its own amount of neurons. Each neuron will throw out a number (or sometimes an array, depending on which layer type you're using). But 10 neurons together will throw out 10 numbers, which will then be packed in an array shaped (30000,10).
The word "None" in those shapes is related to the batch size (the amount of examples you give for training or predicting). You don't define that number, it is automatically understood when you pass a batch.
Looking at your network:
When you have an input of 5 units, you got an input shape of (None,5). But you actually say only (5,) to your model, because the None part is the batch size, which will only appear when training.
This number means: you have to give your network an array with a number of samples, each sample being an array of 5 numbers.
Then, your hidden layer with 10 neurons will calculate and give you 10 numbers as output, in an array shaped as (None, 10).
What is a (None,5,300)?
If you're saying that each word is a 300d vector, there are a few different ways to translate a word in that.
One of the common ways is: how many words you have in your dictionary?
If you have a dictionary with 300 words, you can then make each word be a vector with 300 elements, being all zeros, except for one of them.
Say word "hello" is the first word in your dictionary, it's vector will be [1,0,0,0, ...., 0]
Say word "my" is the second word in your dictionary, it's vector will be [0,1,0,0, ...., 0]
And the word "fly" is the last one in the dictionary, it's vector will be [0,0,0,0, ...., 1]
You do this for your entire dictionary, and whenever you have to pass the word "hello" to your network, you will pass [1,0,0,0 ..., 0] instead.
A sentence with five words will then be an array with five of these arrays. This means, a sentence with five words will be shaped as (5, 300). If you pass 30 thousand sentences as examples: (30000,5,300). In the model, "None" appears as the batch size (None, 5, 300)
There are also other options, such as creating a word Embedding, which will translate the words into vectors of meanings. Meanings which only the network will understand. (There is the Embedding layer on Keras for that).
There are also things called CBOW (continous bag of words).
You have to know what you want to do first, so you can translate your words in some array that fits the network's requirements.
How many neurons do I have for an output of (None,5,300)?
This only tells you about the last layer. The other layers' outputs were all calculated and packed together by the following layers, which changed the output. Each layer has its own output. (When you have a model, you can do a model.summary() and see the output of each layer.)
Even though, it's impossible to answer that question without knowing which types of layers you're using.
There are layers such as Dense that throw out things like (BatchSize,NumberOfNeurons)
But there are layers such as Convolution2D that throw out things like (BatchSize, numberOfChannels, pixelsInX, pixelsInY). For instance, a regular image has three channels: red, blue and green. An array for passing a regular image would be like (3,sizeX,sizeY).
It all depends on which layer type you're using.
Using a word embedding
For using an embedding, it's interesting to read keras documentation about it.
For that you will have to transform your words in indices.
Instead of saying that each word in your dictionary is a vector, you say it's a number.
Word "hello" is 1
Word "my" is 2
Word "fly" is theSizeOfYourDictionary
If you want each sentence to have 100 words, then your input shape will be (None, 100). Where each array of 100 numbers contains numbers representing the words in your dictionary.
The first layer in your model will be an Embedding layer.
model = Sequential()
model.add(Embedding(theSizeOfYourDictionary, 300, input_length=100)
This way, you're creating vectors of size 300 for each word, passing sequences of 100 words. (I'm not used to embeddings, but it seems 300 is a big number, it could be less).
The output of this embedding will be (None, 100, 300).
Then you connect other layers after it.

Dynamic LSTM model in Tensorflow

I am looking to design a LSTM model using Tensorflow, wherein the sentences are of different length. I came across a tutorial on PTB dataset (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/rnn/ptb/ptb_word_lm.py). How does this model capture the instances of varying length? The example does not discuss anything about padding or other technique to handle the variable size sequences.
If I use padding, what should be the unrolling dimension?
You can do this in two way.
TF has a way to specify the input size. Look for a parameter called "sequence_length", I have used this in tf.nn.bidirectional_rnn. So the TF will unroll your cell only up to sequence_length but not to the step size.
Pad your input with predefined dummy input and predefined dummy output (for the dummy output). The lstm cell will learn to predict dummy output for the dummy input. When using it (say for matrix calculation) chop of the dummy parts.
The PTB model is truncated in time -- it always back-propagates a fixed number of steps (num_steps in the configs). So there is no padding -- it just reads the data and tries to predict the next word, and always reads num_steps words at a time.

Tensorflow unrolled LSTM longer than input sequence

I want to create an LSTM in tensorflow to predict time-series data. My training data is a set of input/output sequences of different lengths. Can I include multiple sequences of different lengths in the same training batch? Or do I need to pad them to equal lengths? If so, how?
Also: What will tensorflow do if the unrolled RNN is longer than the input sequence? The rnn() method contains an optional sequence_length argument which appears designed to handle this eventuality, but I'm not clear what it does.
Do you want to build the model from scratch? Otherwise you might want to look into the translate.py-model. Here your issue is taken care of by:
- padding the input (and output) sequences with a PAD-symbol (basically a neutral "no info"-symbol)
- buckets: For different groups of lengths you can create different buckets (makes sense only if your sequence-lengths are very different shortest to longest
You DONT have to batch inputs/output sequence of same length into a batch. TF has a way to specify the input size. The parameter "sequence_length", controls the number of time steps a cell is unrolled. So the TF will unroll your cell only up to sequence_length but not to the step size.
So while feeding the inputs and outputs also feed a sequence_length array which contain the length of each input
tf.nn.bidirectional_rnn(fwd_stacked_lstm_cells, bwd_stacked_lstm_cells,
reshaped_inputs,
sequence_length=sequence_length)
.....
feed_dict={
model.inputs: x,
model.targets: y,
model.sequence_length: lengths})
where
len(lengths) == batch_size and
for all i, lengths[i] == length of input x[i] (same as length of outpu y[i])