Dynamic RNN: padding word vector - tensorflow

I am a bit confused on padding, my first question is:
Is it possible to pad a shorter sequence with values that are not 0? How do you deal with that then in the RNN?
Generally a 0 is used for padding, is there a specific reason why? Does it make it easy in the training because it does not affect the calculation or you still need to mask the loss function?
In case your sentence is composed of vectors embedding from a word2vec model, would padding be applied as a zero vector?
Thanks in advance fir any hint!

Your question is addressed in How to overcome training example's different lengths when working with Word Embeddings (word2vec).
For details on the alternating min/max padding method, see Apply word embeddings to entire document, to get a feature vector.
See also: keras.preprocessing.sequence.pad_sequences, which can take a value to pad with as an argument.

Related

How to customize a Deep Learning model if the output is one-hot vectors? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am trying to build a Deep Learning model with TensorFlow and Keras. This is a sequential model for tasks of Single-Instance Multi-Label, which is a simplified version of Multi-Instance Multi-Label.
Concretely, the input of my model is an array of fixed length, so it can be represented as a vector like this:
The output of my model is a sequence of letters, which are from a alphabet with a fixed size. For example, an alphabet of {A,B,C,D} with only 4 possible members. So I can use a one-hot vector to represent each letter in a sequence.
The length of the sequences is variable, but for simplicity, I use a fixed length(equals to that of the longest sequence) to store all sequences.
If the length of a sequence is shorter than the fixed length, the sequence is represented by one-hot vectors(equal to the seuqence's actual length) and zero vectors(equal to the remaining length). For example, CADB is represented by a 4 * 5 matrix like this:
Please note: the first 4 columns of this matrix are one-hot vectors, each of which has one and only one 1 entry, and all other entries are 0s.
But the entries of the last column are all 0s, which can be seen as a zero padding because the sequence of letters is not long enough.
So in one word, the input is a vector and the output is a matrix.
Different from the link posted above, the output matrix should be seen as a whole. So one input vector is assigned to a whole matrix, not to a row or column of this matrix.
My question is : how to customize my deep learning model for this special output, for example:
What loss function and accuracy metric should I choose or design?
Do I need to customize a special layer at the very end of my model?
You should use softmax activation on the output layer and have categorical_crossentropy as the loss function.
However, as you can see in the links above, the problem is that these two functions by default are applied on the last axis (axis=-1), while in you situation it is the second last axis (the columns of the matrix) that is one-hot encoded.
To use the right axis, one option is to define your own versions of these functions like so:
def softmax_columns(x):
return tf.keras.backend.softmax(x, axis=-2)
def categorical_crossentropy_columns(target, output):
return tf.keras.backend.categorical_crossentropy(target, output, axis=-2)
Then, you can use these like so:
model.add(SomeLayer(..., activation=softmax_columns, ...)) # output layer
model.compile(loss=categorical_crossentropy_columns, ...)
One good alternative (in general, not only here) is to make use of from_logits=True in the categorical_crossentropy call. This effectively makes the softmax built-in into the loss function, so that your model itself does not need (in fact: must not have) the final softmax activation anymore. This not only saves work, but is also more numerically stable.

how to reduce the dimension of the document embedding?

Let us assume that I have a set of document embeddings. (D)
Each of document embedding is consisting of N number of word vectors where each of these pre-trained vector has 300 dimensions.
The corpus would be represented as [D,N,300].
My question is that, what would be the best way to reduce [D,N,300] to [D,1, 300]. How should I represent the document in a single vector instead of N vectors?
Thank you in advance.
I would say that what you are looking for is doc2vec. Using this you can convert the whole document into a one, 300-dimensional vector. You can use it like this:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(your_documents)]
model = Doc2Vec(documents, vector_size=300, window=2, min_count=1, workers=4)
This will train the model on your data and you will be able to represent each document with only one vector as you specified in the question.
You can run inferrance with:
vector = model.infer_vector(doc_words)
I hope this is helpful :)
It's fairly common and fairly (perhaps surprisingly) effective to simply average the word vectors.
Good question but all the answers will result in the some loss of information. The best way for you is to use a Bi-LSTM/GRU layer and provide your word embeddings as input to that layer. And take the output of last time step.
The output of last timestep will have all the contextual information of document both in forward and backward direction. And hence, this is the best way to get what you want as the model learns the representation.
Note that, the larger the document, the more loss of information.

Tensorflow: Using one tensor to index slices of another [duplicate]

This question already has answers here:
Get the last output of a dynamic_rnn in TensorFlow
(4 answers)
Closed 4 years ago.
As motivation for this question, I'm trying to use variable length sequences with tf.nn.dynamic_rnn. When I was training with batch_size=1 (one element at a time), everything was going swimmingly, but now I'm trying to increase the batch size, which means zero-padding sequences to the same length.
I've zero-padded (or truncated) all of my sequences up to the max length of 15000.
outputs (from the RNN) has shape [batch_size, max_seq_length, num_units], which for concreteness is right now [16, 15000, 64].
I also create a seq_lengths tensor, which is [batch_size], so [16], corresponding to the actual sequence length of all the zero-padded sequences.
I've added a fully connected layer, to multiply what was previously outputs[:,-1,:] by W, then add a bias term, since ultimately I'm just trying to predict a single value (or rather batch_size values). However, now, I can't just naively use -1 as the index, because all of the sequences have been variously padded! I have seq_lengths, but I'm not sure exactly how to use it to index outputs. I've searched around, and I think the answer is some clever use of tf.gather_nd, but I can't quite figure it out. I can easily see how to take individual values, but I want to preserve entire slices. Do I need to create some sort of enormous 3D mask?
Here's what I want in terms of a Python comprehension (outputs is an np.array): outputs = np.array([outputs[i, seq_lengths[i], :] for i in range(batch_size)]).
I'd appreciate any help! Thank you.
Actually, Alex it turns out you've already answered my question for me :).
After some more research, I came across the following, which is exactly my use case: https://stackoverflow.com/a/43298689/5526865 . I won't copy the code here, but just check that out.

Properly concatenate feature maps in Tensorflow

I am attempting to reproduce a Convolution Neural Network from a research paper using Tensorflow.
There are many times in the diagram where the results of convolutions are concatenated. Currently I am using tf.concat(https://www.tensorflow.org/api_docs/python/tf/concat) along the last axis (representing channels) to concatenate these feature maps. I originally believed that I would want to concatenate along all axes, but this does not seem to be an option in tensorflow. Now I am facing the problem where the paper indicates that tensors(feature maps) of different sizes should be concatenated. tf.concat does not support concatenations of different sizes, so I am wondering if this was the correct command to use in the first place. In summary, what is the correct way to concatenate feature maps(sometimes of different sizes) in tensorflow?
Thank you.
It's impossible and meaningless to concatenate features maps with different sizes.
If you want to concatenate 2 tensors, every dimension except the concatenation one must be equal.
From the image you posted, in fact, you can see that every feature map that gets concatenated, has the same spatial extent (but different depth) of the other one.
If you can't concatenate in that way, probabily that's something wrong in your code, and probably the problem is the lack of padding = valid in the convolution operation.
The problem that you encounter for inception network may be resolved by using padding in convolutional layers to keep the size same. For inception blocks, instead of using "VALID" padding, change it to "SAME" one. So, without requiring any resizing, you can concatenate the outputs.
Alternatively, you can append padding to the feature maps that are going to be concatenated. You can do that by using tf.pad().
If you don't prefer to do this one, you can use tf.image.resize_images function to resize them to same values. However, this is a dirty and computationally expensive approach.
Tensors can only be concatenated along one axis. If you need to concatenate feature maps of different sizes, you must somehow manipulate the sizes of the original tensors.

How to train with inputs of variable size?

This question is rather abstract and not necessarily tied to tensorflow or keras. Say that you want to train a language model, and you want to use inputs of different sizes for your LSTMs. Particularly, I'm following this paper: https://www.researchgate.net/publication/317379370_A_Neural_Language_Model_for_Query_Auto-Completion.
The authors use, among other things, word embeddings and one-hot encoding of characters. Most likely, the dimensions of each of these inputs are different. Now, to feed that into a network, I see a few alternatives but I'm sure I'm missing something and I would like to know how it should be done.
Create a 3D tensor of shape (instances, 2, max(embeddings,characters)). That is, padding the smaller input with 0s.
Create a 3D tensor of shape (instances, embeddings+characters, 1)). That is, concatenating inputs.
It looks to me that both alternatives are bad for efficiently training the model. So, what's the best way to approach this? I see the authors use an embedding layer for this purpose, but technically, what does that mean?
EDIT
Here are more details. Let's call these inputs X (character-level input) and E (word-level input). On each character of a sequence (a text), I compute x, e and y, the label.
x: character one-hot encoding. My character index is of size 38, so this is a vector filled with 37 zeros and one 1.
e: precomputed word embedding of dimension 200. If the character is a space, I fetch the word embedding of the previous word in the sequence, Otherwise, I assign the vector for incomplete word (INC, also of size 200). Real example with the sequence "red car": r>INC, e>INC, d>INC, _>embeddings["red"], c>INC, a>INC, r>INC.
y: the label to be predicted, which is the next character, one-hot encoded. This output is of the same dimension as x because it uses the same character index. In the example above, for "r", y is the one-hot encoding of "e".
According to keras documentation, the padding idea seems to be the one. There is the masking parameter in the embedding layer, that will make keras skip these values instead of processing them. In theory, you don't lose that much performance. If the library is well built, the skipping is actually skipping extra processing.
You just need to take care not to attribute the value zero to any other character, not even spaces or unknown words.
An embedding layer is not only for masking (masking is just an option in an embedding layer).
The embedding layer transforms integer values from a word/character dictionary into actual vectors of a certain shape.
Suppose you have this dictionary:
1: hey
2: ,
3: I'm
4: here
5: not
And you form sentences like
[1,2,3,4,0] -> this is "hey, I'm here"
[1,2,3,5,4] -> this is "hey, I'm not here"
[1,2,1,2,1] -> this is "hey, hey, hey"
The embedding layer will tranform each of those integers into vectors of a certain size. This does two good things at the same time:
Transforms the words in vectors because neural networks can only handle vectors or intensities. A list of indices cannot be processed by a neural network directly, there is no logical relation between indices and words
Creates a vector that will be a "meaningful" set of features for each word.
And after training, they become "meaningful" vectors. Each element starts to represent a certain feature of the word, although that feature is obscure to humans. It's possible that an embedding be capable of detecting words that are verbs, nouns, feminine, masculine, etc, everything encoded in a combination of numeric values (presence/abscence/intensity of features).
You may also try the approach in this question, which instead of using masking, needs to separate batches by length, so each batch can be trained at a time without needing to pad them: Keras misinterprets training data shape