I am looking to design a LSTM model using Tensorflow, wherein the sentences are of different length. I came across a tutorial on PTB dataset (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/rnn/ptb/ptb_word_lm.py). How does this model capture the instances of varying length? The example does not discuss anything about padding or other technique to handle the variable size sequences.
If I use padding, what should be the unrolling dimension?
You can do this in two way.
TF has a way to specify the input size. Look for a parameter called "sequence_length", I have used this in tf.nn.bidirectional_rnn. So the TF will unroll your cell only up to sequence_length but not to the step size.
Pad your input with predefined dummy input and predefined dummy output (for the dummy output). The lstm cell will learn to predict dummy output for the dummy input. When using it (say for matrix calculation) chop of the dummy parts.
The PTB model is truncated in time -- it always back-propagates a fixed number of steps (num_steps in the configs). So there is no padding -- it just reads the data and tries to predict the next word, and always reads num_steps words at a time.
Related
I have two questions about how to use Tensorflow implementation of the Transformers for text classifications.
First, it seems people mostly used only the encoder layer to do the text classification task. However, encoder layer generates one prediction for each input word. Based on my understanding of transformers, the input to the encoder each time is one word from the input sentence. Then, the attention weights and the output is calculated using the current input word. And we can repeat this process for all of the words in the input sentence. As a result we'll end up with pairs of (attention weights, outputs) for each word in the input sentence. Is that correct? Then how would you use this pairs to perform a text classification?
Second, based on the Tensorflow implementation of transformer here, they embed the whole input sentence to one vector and feed a batch of these vectors to the Transformer. However, I expected the input to be a batch of words instead of sentences based on what I've learned from The Illustrated Transformer
Thank you!
There are two approaches, you can take:
Just average the states you get from the encoder;
Prepend a special token [CLS] (or whatever you like to call it) and use the hidden state for the special token as input to your classifier.
The second approach is used by BERT. When pre-training, the hidden state corresponding to this special token is used for predicting whether two sentences are consecutive. In the downstream tasks, it is also used for sentence classification. However, my experience is that sometimes, averaging the hidden states give a better result.
Instead of training a Transformer model from scratch, it is probably more convenient to use (and eventually finetune) a pre-trained model (BERT, XLNet, DistilBERT, ...) from the transformers package. It has pre-trained models ready to use in PyTorch and TensorFlow 2.0.
The Transformers are designed to take the whole input sentence at once. The main motive for designing a transformer was to enable parallel processing of the words in the sentences. This parallel processing is not possible in LSTMs or RNNs or GRUs as they take words of the input sentence as input one by one.
So in the encoder part of the transformers, the very first layer contains the number of units equal to the number of words in a sentence and then each unit converts that word into an embedding vector corresponding to that word. Further, the rest of the processes are carried out. For more details, you can go through the article: http://jalammar.github.io/illustrated-transformer/
How to use this transformer for text classification - Since in text classification our output is a single number not a sequence of numbers or vectors so we can remove the decoder part and just use the encoder part. The output of the encoder is a set of vectors, the same in number as the number of words in the input sentence. Further, we can feed these sets of output vectors into a CNN, or we can add an LSTM or RNN model and perform classification.
The input is the whole sentence or batch of sentences not word by word. Surely you would have misunderstood it.
I implemented the CNN model for text classification based on this paper. Since the CNN can only deal with the sentences that have fixed size, so I set the size of input as max length of sentence in my dataset and zero padding the short sentence. But for my understanding, no matter how long the input sentence is, the max pooling strategy will always extract only one value for each filter map. So it doesn't matter the size of input sentence is long or short, because after filter convoluted/pooled, the output will be the same size. In this case, why should I zero padding all the short sentence into the fixed size?
For example, my code for feeding data into the CNN model is self.input_data = tf.placeholder(tf.int32,[None,max_len],name="input_data"), can I do not specify max_len, and using the None value which is based on the length of current training sentence?
In addition, I was wondering is there any other new approach that can solve the variable input for CNN model. I also found the other paper that can solve this problem, but for my understanding, it only used k values for max-pooling instead of 1 value of max-pooling, which can deal with variable sentence? How?
Quick answer:
No you can't
Longer answer:
Pooling is like a reduce function. Applying it on a layer reduces the dimensions. But different input shapes don't produce the same output shapes. However with zero padding you can probably simulate this, with max_len we are doing this. So, in the second paper, the idea is to have a dynamic computational graph. It is not the same thing as before. It is basically creating several networks with different depths (depending on their input size). The generalized version for encoder-decoder architecture is called bytenet
Let's say we have 3 buckets of different lengths. So do we train 3 different nets?
Can't we keep a dynamic RNN. Where it will add units according to the length of input sequence in the encoder. Then encoder will pass the last hidden state to the decoder. Will it work?
I went through this. Bucketing is help to speed up the training process. We first divide examples in to buckets. Then we can reduce the number of units we have to pad.
In the training iterations we select one bucket at a time and train the whole network.
In the validation we check the perplexity of the test examples in each bucket.
Tensorflow support this bucketing.
Dynamic RNN is different here we don't have a bucketing mechanism. It we input data as a tensor with the shape of [batch_size,hidden size,max_seq_length].
Here sequences shorter than the maximum length should padded with zeros.
Then it will create dynamic RNNs that their length equal to actual inputs (without padded zeros). This uses a while loop in the tensorflow.
I want to train a bi-directional LSTM in tensorflow to perform a sequence classification problem (sentiment classification).
Because sequences are of variable lengths, batches are normally padded with vectors of zero. Normally, I use the sequence_length parameter in the uni-directional RNN to avoid training on the padding vectors.
How can this be managed with bi-directional LSTM. Does the "sequence_length" parameter work automatically starts from an advanced position in the sequence for the backward direction?
Thank you
bidirectional_dynamic_rnn also has a sequence_length parameter that takes care of sequences of variable lengths.
https://www.tensorflow.org/api_docs/python/tf/nn/bidirectional_dynamic_rnn (mirror):
sequence_length: An int32/int64 vector, size [batch_size], containing the actual lengths for each of the sequences.
You can see an example here: https://github.com/Franck-Dernoncourt/NeuroNER/blob/master/src/entity_lstm.py
In forward pass, rnn cell will stop at sequence_length which is the no-padding length of the input and is a parameter in tf.nn.bidirectional_dynamic_rnn. In backward pass, it firstly use function tf.reverse_sequence to reverse the first sequence_length elements and then traverse like that in the forward pass.
https://tensorflow.google.cn/api_docs/python/tf/reverse_sequence
This op first slices input along the dimension batch_axis, and for each slice i, reverses the first seq_lengths[i] elements along the dimension seq_axis.
I want to create an LSTM in tensorflow to predict time-series data. My training data is a set of input/output sequences of different lengths. Can I include multiple sequences of different lengths in the same training batch? Or do I need to pad them to equal lengths? If so, how?
Also: What will tensorflow do if the unrolled RNN is longer than the input sequence? The rnn() method contains an optional sequence_length argument which appears designed to handle this eventuality, but I'm not clear what it does.
Do you want to build the model from scratch? Otherwise you might want to look into the translate.py-model. Here your issue is taken care of by:
- padding the input (and output) sequences with a PAD-symbol (basically a neutral "no info"-symbol)
- buckets: For different groups of lengths you can create different buckets (makes sense only if your sequence-lengths are very different shortest to longest
You DONT have to batch inputs/output sequence of same length into a batch. TF has a way to specify the input size. The parameter "sequence_length", controls the number of time steps a cell is unrolled. So the TF will unroll your cell only up to sequence_length but not to the step size.
So while feeding the inputs and outputs also feed a sequence_length array which contain the length of each input
tf.nn.bidirectional_rnn(fwd_stacked_lstm_cells, bwd_stacked_lstm_cells,
reshaped_inputs,
sequence_length=sequence_length)
.....
feed_dict={
model.inputs: x,
model.targets: y,
model.sequence_length: lengths})
where
len(lengths) == batch_size and
for all i, lengths[i] == length of input x[i] (same as length of outpu y[i])