When training seq2seq model with bucketing method do we keep separate RNNs for each bucket? - tensorflow

Let's say we have 3 buckets of different lengths. So do we train 3 different nets?
Can't we keep a dynamic RNN. Where it will add units according to the length of input sequence in the encoder. Then encoder will pass the last hidden state to the decoder. Will it work?

I went through this. Bucketing is help to speed up the training process. We first divide examples in to buckets. Then we can reduce the number of units we have to pad.
In the training iterations we select one bucket at a time and train the whole network.
In the validation we check the perplexity of the test examples in each bucket.
Tensorflow support this bucketing.
Dynamic RNN is different here we don't have a bucketing mechanism. It we input data as a tensor with the shape of [batch_size,hidden size,max_seq_length].
Here sequences shorter than the maximum length should padded with zeros.
Then it will create dynamic RNNs that their length equal to actual inputs (without padded zeros). This uses a while loop in the tensorflow.

Related

Why does Keras accept a batch size option for model.evaluate?

Why does the evaluate function of the Keras API in Tensorflow accept a batch_size? To my knowledge, this parameter should only be relevant for managing how many samples we use per iteration during training. What influence does this choice have during model evaluation?
Batch size is mainly used in Sequence-based predictions or in Time series predictions.
Below are the cases where you have to use batch size while prediction.
In Time Series use cases it may be desirable to use a large batch size when training the network and a batch size of 1 when making predictions in order to predict the next step in the sequence.
For Stateful RNN it is required to provide a fixed batch size during prediction/evaluation where the output state of the current batch is used as the initial state for the next batch. They keep information from one batch to another batch.
If your model doesn't fall into these kinds of category technically you don't need to provide batch size as input during evaluating. Even if you provide batch size, it's how much data you are feeding at a time for GPU.

Keras : Shuffling dataset while using LSTM

Correct me if I am wrong but according to the official Keras documentation, by default, the fit function has the argument 'shuffle=True', hence it shuffles the whole training dataset on each epoch.
However, the point of using recurrent neural networks such as LSTM or GRU is to use the precise order of each data so that the state of the previous data influence the current one.
If we shuffle all the data, all the logical sequences are broken. Thus I don't understand why there are so much examples of LSTM where the argument is not set to False. What is the point of using RNN without sequences ?
Also, when I set the shuffle option to False, my LSTM model is less performant eventhought there are dependencies between the data: I use the KDD99 dataset where the connections are linked.
If we shuffle all the data, all the logical sequences are broken.
No, the shuffling happens on the batches axis, not on the time axis.
Usually, your data for an RNN has a shape like this: (batch_size, timesteps, features)
Usually, you give your network not only one sequence to learn from, but many sequences. Only the order in which these many sequences are being trained on gets shuffled. The sequences themselves stay intact.
Shuffling is usually always a good idea because your network shall only learn the training examples themselves, not their order.
This being said, there are cases where you have indeed only one huge sequence to learn from. In that case you have the option to still divide your sequence into several batches. If this is the case, you are absolutely right with your concern that shuffling would have a huge negative impact, so don't do that in this case!
Note: RNNs have a stateful parameter that you can set to True. In that case the last state of the previous batch will be passed to the following one which effectively makes your RNN see all batches as one huge sequence. So, absolutely do this, if you have a huge sequence over multiple batches.

Large trainable embedding layer slows down training

I am training a network to classify text with a LSTM. I use a randomly initialized and trainable embedding layer for the word inputs. The network is trained with the Adam Optimizer and the words are fed into the network with a one-hot-encoding.
I noticed that the number of words which are represented in the embedding layer influences heavily the training time, but I don't understand why. Increasing the number of words in the network from 200'000 to 2'000'000 almost doubled the time for a training epoch.
Shouldn't the training only update weights which where used during the prediction of the current data point. Thus if my input sequence has always the same length, there should always happen the same number of updates, regardless of the size of the embedding layer.
The number of updates needed would be reflected in the number of epochs it takes to reach a certain precision.
If your observation is that convergence takes the same number of epochs, but each epoch takes twice as much wall clock time, then it's an indication that simply performing the embedding lookup (and writing the update of embedding table) now takes a significant part of your training time.
Which could easily be the case. 2'000'000 words times 4 bytes per float32 times the length of your embedding vector (what is it? let's assume 200) is something like 1.6 gigabytes of data that needs to be touched every minibatch. You're also not saying how you're training this (CPU, GPU, what GPU) which has a meaningful impact on how this should turn out because of e.g. cache effects, as for CPU doing the exact same number of reads/writes in a slightly less cache-friendly manner (more sparsity) can easily double the execution time.
Also, your premise is a bit unusual. How much labeled data do you have that would have enough examples of the #2000000th rarest word to calculate a meaningful embedding directly? It's probably possible, but would be unusual, in pretty much all datasets, including very large ones, the #2000000th word would be a nonce and thus it'd be harmful to include it in trainable embeddings. The usual scenario would be to calculate large embeddings separately from large unlabeled data and use that as a fixed untrainable layer, and possibly concatenate them with small trainable embeddings from labeled data to capture things like domain-specific terminology.
If I understand correctly, your network takes one-hot vectors representing words to embeddings of some size embedding_size. Then the embeddings are fed as input to an LSTM. The trainable variables of the network are both those of the embedding layer and the LSTM itself.
You are correct regarding the update of the weights in the embedding layer. However, the number of weights in one LSTM cell depends on the size of the embedding. If you look for example at the equation for the forget gate of the t-th cell,
you can see that the matrix of weights W_f is multiplied by the input x_t, meaning that one of the dimensions of W_f must be exactly embedding_size. So as embedding_size grows, so does the network size, so it takes longer to train.

LSTM how batch size and sequence length affect memory

I have a question regarding batch size and sequence length. Let’s suppose that I have 10 different independent time series, each of length 100.
5 are of a person doing one activity, and the other 5 are of a person doing another activity.
I want to create an LSTM that will be able to remember the sequences all the way from the first sample in each sequence and classify test samples that I input into one activity or the other.
Now, for a first try, let’s say that I can input test-samples of length 100. How would I do this? Would I create an LSTM and then feed in data of the shape [10, 100, 1] in one go? Or would I feed in data of the shape [1, 100, 1] 10 times? The question here is that does batching affect how the LSTM will memorize past inputs? I do not want the LSTM to remember between independent sequences, but I do want it to remember all the way from the beginning of each time sequence.
Secondly, let’s say that I want to now chunk up the sequences I use to train the LSTM. The goal remains the same as before. So now I window the sequences into chunks of 10. Do I feed it in as [10,10,1] for each sequence? If I do this, will the LSTM memorize the temporal dynamics of the sequence all the way to the beginning? Will doing the LSTM this way be analogous to not chunking up the sequences and feeding them in full length?
I can answer the part of your question that has to do with batching. There are two reasons to batch.
It is more efficient for the computer to do the matrix multiplications in batches. If you are doing it on a CPU then part of the efficiency comes from being able to cache the matrix and not have to reload it from memory. During evaluation, the sequences in the batch do not interfere with each other. It is the same as if each one is computed individually.
During training, having multiple sequences in a batch reduces noise in the gradient. The weight update is computed by averaging the gradients of all the sequences in the batch. Having more sequences gives a more reliable estimate of which direction to move the parameters in order to improve the loss function.

Dynamic LSTM model in Tensorflow

I am looking to design a LSTM model using Tensorflow, wherein the sentences are of different length. I came across a tutorial on PTB dataset (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/rnn/ptb/ptb_word_lm.py). How does this model capture the instances of varying length? The example does not discuss anything about padding or other technique to handle the variable size sequences.
If I use padding, what should be the unrolling dimension?
You can do this in two way.
TF has a way to specify the input size. Look for a parameter called "sequence_length", I have used this in tf.nn.bidirectional_rnn. So the TF will unroll your cell only up to sequence_length but not to the step size.
Pad your input with predefined dummy input and predefined dummy output (for the dummy output). The lstm cell will learn to predict dummy output for the dummy input. When using it (say for matrix calculation) chop of the dummy parts.
The PTB model is truncated in time -- it always back-propagates a fixed number of steps (num_steps in the configs). So there is no padding -- it just reads the data and tries to predict the next word, and always reads num_steps words at a time.