variable size of input for CNN model in text classification? - tensorflow

I implemented the CNN model for text classification based on this paper. Since the CNN can only deal with the sentences that have fixed size, so I set the size of input as max length of sentence in my dataset and zero padding the short sentence. But for my understanding, no matter how long the input sentence is, the max pooling strategy will always extract only one value for each filter map. So it doesn't matter the size of input sentence is long or short, because after filter convoluted/pooled, the output will be the same size. In this case, why should I zero padding all the short sentence into the fixed size?
For example, my code for feeding data into the CNN model is self.input_data = tf.placeholder(tf.int32,[None,max_len],name="input_data"), can I do not specify max_len, and using the None value which is based on the length of current training sentence?
In addition, I was wondering is there any other new approach that can solve the variable input for CNN model. I also found the other paper that can solve this problem, but for my understanding, it only used k values for max-pooling instead of 1 value of max-pooling, which can deal with variable sentence? How?

Quick answer:
No you can't
Longer answer:
Pooling is like a reduce function. Applying it on a layer reduces the dimensions. But different input shapes don't produce the same output shapes. However with zero padding you can probably simulate this, with max_len we are doing this. So, in the second paper, the idea is to have a dynamic computational graph. It is not the same thing as before. It is basically creating several networks with different depths (depending on their input size). The generalized version for encoder-decoder architecture is called bytenet

Related

Predict all probable trajectories in a grid structure using Keras

I'm trying to predict sequences of 2D coordinates. But I don't want only the most probable future path but all the most probable paths to visualize it in a grid map.
For this I have traning data consisting of 40000 sequences. Each sequence consists of 10 2D coordinate pairs as input and 6 2D coordinate pairs as labels.
All the coordinates are in a fixed value range.
What would be my first step to predict all the probable paths? To get all probable paths I have to apply a softmax in the end, where each cell in the grid is one class right? But how to process the data to reflect this grid like structure? Any ideas?
A softmax activation won't do the trick I'm afraid; if you have an infinite number of combinations, or even a finite number of combinations that do not already appear in your data, there is no way to turn this into a multi-class classification problem (or if you do, you'll have loss of generality).
The only way forward I can think of is a recurrent model employing variational encoding. To begin with, you have a lot of annotated data, which is good news; a recurrent network fed with a sequence X (10,2,) will definitely be able to predict a sequence Y (6,2,). But since you want not just one but rather all probable sequences, this won't suffice. Your implicit assumption here is that there is some probability space hidden behind your sequences, which affects how they play out over time; so to model the sequences properly, you need to model that latent probability space. A Variational Auto-Encoder (VAE) does just that; it learns the latent space, so that during inference the output prediction depends on sampling over that latent space. Multiple predictions over the same input can then result in different outputs, meaning that you can finally sample your predictions to empirically approximate the distribution of potential outputs.
Unfortunately, VAEs can't really be explained within a single paragraph over stackoverflow, and even if they could I wouldn't be the most qualified person to attempt it. Try searching the web for LSTM-VAE and arm yourself with patience; you'll probably need to do some studying but it's definitely worth it. It might also be a good idea to look into Pyro or Edward, which are probabilistic network libraries for python, better suited to the task at hand than Keras.

What are the effects of padding a tensor?

I'm working on a problem using Keras that has been presenting me with issues:
My X data is all of shape (num_samples, 8192, 8), but my Y data is of shape (num_samples, 4), where 4 is a one-hot encoded vector.
Both X and Y data will be run through LSTM layers, but the layers are rejecting the Y data because it doesn't match the shape of the X data.
Is padding the Y data with 0s so that it matches the dimensions of the X data unreasonable? What kind of effects would that have? Is there a better solution?
Edited for clarification:
As requested, here is more information:
My Y data represents the expected output of passing the X data through my model. This is my first time working with LSTMs, so I don't have an architecture in mind, but I'd like to use an architecture that works well with classifying long (8192-length) sequences of words into one of several categories. Additionally, the dataset that I have is of an immense size when fed through an LSTM, so I'm currently using batch-training.
Technologies being used:
Keras (Tensorflow Backend)
TL;DR Is padding one tensor with zeroes in all dimensions to match another tensor's shape a bad idea? What could be a better approach?
First of all, let's make sure your representation is actually what you think it is; the input to an LSTM (or any recurrent layer, for that matter) must be of dimensionality: (timesteps, shape), i.e. if you have 1000 training samples, each consisting of 100 timesteps, with each timestep having 10 values, your input shape will be (100,10,). Therefore I assume from your question that each input sample in your X set has 8192 steps and 8 values per step. Great; a single LSTM layer can iterate over these and produce 4-dimensional representations with absolutely no problem, just like so:
myLongInput = Input(shape=(8192,8,))
myRecurrentFunction = LSTM(4)
myShortOutput = myRecurrentFunction(myLongInput)
myShortOutput.shape
TensorShape([Dimension(None), Dimension(4)])
I assume your problem stems from trying to apply yet another LSTM on top of the first one; the next LSTM expects a tensor that has a time dimension, but your output has none. If that is the case, you'll need to let your first LSTM also output the intermediate representations at each time step, like so:
myNewRecurrentFunction=LSTM(4, return_sequences=True)
myLongOutput = myNewRecurrentFunction(myLongInput)
myLongOutput.shape
TensorShape([Dimension(None), Dimension(None), Dimension(4)])
As you can see the new output is now a 3rd order tensor, with the second dimension now being the (yet unassigned) timesteps. You can repeat this process until your final output, where you usually don't need the intermediate representations but rather only the last one. (Sidenote: make sure to set the activation of your last layer to a softmax if your output is in one-hot format)
On to your original question, zero-padding has very little negative impact on your network. The network will strain itself a bit in the beginning trying to figure out the concept of the additional values you have just thrown at it, but will very soon be able to learn they're meaningless. This comes at a cost of a larger parameter space (therefore more time and memory complexity), but doesn't really affect predictive power most of the time.
I hope that was helpful.

Keras : variable input sequence: padding vs shape None?

I've found two possible solution of handling with variable-size input sequences for RNN in Keras.
The solution one:
input = Input(shape=(None, num_classes))
then I can put any sequence size as an input for both training and validation.
The solution two:
input = Input(shape=(max_seq_length, num_classes))
...
pad_sequences(input_data, maxlen=max_seq_length, padding='post')
Which solution is recommended?
I consider benefits of these two. What I can see in the solution two is kind of validation of input size. The input cannot be larger than max_seq_size, moreover I can decide of type of padding (pre/post) and the same for timing of too large sequence.
What kind of padding and trimming is done using the solution one? Default parameters of pad_sequence?
I've benchmarked the time of training model for both solution and it's roughly the same time. I guess, that under the hood it's the same, like the max_seq_length is calculated from max length of training sequence, am I right?
Thank you for any clarification!
There is simply no padding or trimming in solution one. It takes the sequence as is and processes it. The model is totally independent of sequence length.
In solution two, the best to do is add a Masking layer. It will simply skip processing the padded values.

What does size of the GRU or LSTM cell in the TensorFlow seq2seq tutorial represent?

I'm working with the seq2seq model in the TensorFlow tutorials, and I'm having trouble understanding some of the details. One thing that is confusing to me is what the "size" of a cell represents. I think I have a high level understanding of images like
I believe this is showing that the output from the last step in the encoder is the input to the first step in the encoder. In this case each box is the GRU or LSTM cell at a different time-step in the sequence.
I also think I understand, at a superficial level, diagrams like this:
from colah's blog post about LSTM and GRU cells. My understanding is that a "cell" is a neural network that feeds the output from one step back into itself along with the new input for the subsequent step. The gates control how much it "remembers" and "forgets."
I think I am getting confused at the level between this superficial, high level understanding and the low-level details. It sounds like the "size" of a cell is the number of nodes in the sigmoid and tanh boxes. Is that correct? If so, how does that relate to the input size for the seq2seq model? For example, the default vocabulary size is 40,000, and the default cell size is 1024. How does the 40,000 element one-hot vocabulary vector for each step of the sequence get matched to the 1024 node internal cell size? Is that what the embedding wrapper does?
Most importantly, what effect would increasing or decreasing the size of the cell have? Would a larger cell be better at learning embeddings? Or at predicting outputs? Both?
It sounds like the "size" of a cell is the number of nodes in the
sigmoid and tanh boxes. Is that correct?
The size of the cell is the size of the RNN state vector h. In the case of LSTM it's also the size of c. It's not "the number of nodes" (I'm not sure what you mean by nodes).
If so, how does that relate to the input size for the seq2seq model?
For example, the default vocabulary size is 40,000, and the default
cell size is 1024. How does the 40,000 element one-hot vocabulary
vector for each step of the sequence get matched to the 1024 node
internal cell size?
The input size for the model is independent of the state size. The two vectors (input and state) are concatenated and multiplied by a matrix of shape [state_size + input_size, state_size] to get the next state (simplified version).
Is that what the embedding wrapper does?
No, the embedding is the result of multiplying the 1-hot input vector with a matrix of size [vocab_size, input_size], before doing the multiplication.

Dynamic LSTM model in Tensorflow

I am looking to design a LSTM model using Tensorflow, wherein the sentences are of different length. I came across a tutorial on PTB dataset (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/rnn/ptb/ptb_word_lm.py). How does this model capture the instances of varying length? The example does not discuss anything about padding or other technique to handle the variable size sequences.
If I use padding, what should be the unrolling dimension?
You can do this in two way.
TF has a way to specify the input size. Look for a parameter called "sequence_length", I have used this in tf.nn.bidirectional_rnn. So the TF will unroll your cell only up to sequence_length but not to the step size.
Pad your input with predefined dummy input and predefined dummy output (for the dummy output). The lstm cell will learn to predict dummy output for the dummy input. When using it (say for matrix calculation) chop of the dummy parts.
The PTB model is truncated in time -- it always back-propagates a fixed number of steps (num_steps in the configs). So there is no padding -- it just reads the data and tries to predict the next word, and always reads num_steps words at a time.