Question about input_dim in keras embedding layer - tensorflow

From the documentation on tf.keras.layers.Embedding :
input_dim:
Integer. Size of the vocabulary, i.e. maximum integer index + 1.
mask_zero:
Boolean, whether or not the input value 0 is a special “padding” value that should be masked
out. This is useful when using recurrent layers which may take variable length input. If this
is True, then all subsequent layers in the model need to support masking or an exception will
be raised. If mask_zero is set to True, as a consequence, index 0 cannot be used in the
vocabulary (input_dim should equal size of vocabulary + 1).
I was reading this answer but I'm still confused. If my vocabulary size is n but they are encoded with index values from 1 to n (0 is left for padding), is input_dim equal to n or n+1?
If the inputs are padded with zeroes, what are the consequences of leaving mask_zero = False?
If mask_zero = True, based on the documentation, I would have to increment the answer from my first question by one? What is the expected behaviour if this was not done?

I am basically just trying to rephrase parts of the linked answer to make it a bit more understandable in the current context, and also address your other subquestions (which technically should be their own questions, according to [ask]).
It does not matter whether you actually use 0 for padding or not, Keras assumes that you will start indexing from zero and will have to "brace itself" for an input value of 0 in your data. Therefore, you need to choose the value as n+1, because you are essentially just adding a specific value to your vocabulary that you previously didn't consider.
I think this is out of scope for this question to discuss in detail, but - depending on the exact model - the loss values on padded positions do not affect the backpropagation. However, if you choose mask_zero = False, your model will essentially have to correctly predict padding on all those positions (where the padding then also affects the training).
This relates to my illustration: Essentially, you are adding a new vocabulary index. If you do not adjust your dimension, there will likely be an indexing error (out of range) for the vocabulary entry with the highest index (n). Otherwise, you would likely not notice any different behavior.

Related

Dynamic RNN: padding word vector

I am a bit confused on padding, my first question is:
Is it possible to pad a shorter sequence with values that are not 0? How do you deal with that then in the RNN?
Generally a 0 is used for padding, is there a specific reason why? Does it make it easy in the training because it does not affect the calculation or you still need to mask the loss function?
In case your sentence is composed of vectors embedding from a word2vec model, would padding be applied as a zero vector?
Thanks in advance fir any hint!
Your question is addressed in How to overcome training example's different lengths when working with Word Embeddings (word2vec).
For details on the alternating min/max padding method, see Apply word embeddings to entire document, to get a feature vector.
See also: keras.preprocessing.sequence.pad_sequences, which can take a value to pad with as an argument.

Tensorflow boolean_mask with dynamic mask

The documentation of boolean_mask says that the shape of the mask must be known statically. But if you do
mask.set_shape([None])
tf.boolean_mask(tensor, mask)
it seems to work fine. Is there any reason to not do this?
Looking at the documentation closely reveals that it concerns the dimensionality of the mask, not its whole shape:
mask: K-D boolean tensor, K <= N and K must be known statically.
Your mask now has size None, meaning its static shape is completely unknown, including the dimension. Your options are to either to ensure that the dimensionality of the mask is statically known (e.g., make sure its produced by an operation whose output dimensions are known, or feed a placeholder with known dimensions), or to enforce information about the size that you know, but that cannot be inferred at time of the construction of the computational graph. The latter you can do by set_shape.
When you run mask.set_shape([None]), you are enforcing an assumption that the dimensionality of the mask will always be 1 (since None is in brackets), although the number of elements is unknown. If you are certain that your mask will always be 1-dimensional, this is fine to do.

How to train with inputs of variable size?

This question is rather abstract and not necessarily tied to tensorflow or keras. Say that you want to train a language model, and you want to use inputs of different sizes for your LSTMs. Particularly, I'm following this paper: https://www.researchgate.net/publication/317379370_A_Neural_Language_Model_for_Query_Auto-Completion.
The authors use, among other things, word embeddings and one-hot encoding of characters. Most likely, the dimensions of each of these inputs are different. Now, to feed that into a network, I see a few alternatives but I'm sure I'm missing something and I would like to know how it should be done.
Create a 3D tensor of shape (instances, 2, max(embeddings,characters)). That is, padding the smaller input with 0s.
Create a 3D tensor of shape (instances, embeddings+characters, 1)). That is, concatenating inputs.
It looks to me that both alternatives are bad for efficiently training the model. So, what's the best way to approach this? I see the authors use an embedding layer for this purpose, but technically, what does that mean?
EDIT
Here are more details. Let's call these inputs X (character-level input) and E (word-level input). On each character of a sequence (a text), I compute x, e and y, the label.
x: character one-hot encoding. My character index is of size 38, so this is a vector filled with 37 zeros and one 1.
e: precomputed word embedding of dimension 200. If the character is a space, I fetch the word embedding of the previous word in the sequence, Otherwise, I assign the vector for incomplete word (INC, also of size 200). Real example with the sequence "red car": r>INC, e>INC, d>INC, _>embeddings["red"], c>INC, a>INC, r>INC.
y: the label to be predicted, which is the next character, one-hot encoded. This output is of the same dimension as x because it uses the same character index. In the example above, for "r", y is the one-hot encoding of "e".
According to keras documentation, the padding idea seems to be the one. There is the masking parameter in the embedding layer, that will make keras skip these values instead of processing them. In theory, you don't lose that much performance. If the library is well built, the skipping is actually skipping extra processing.
You just need to take care not to attribute the value zero to any other character, not even spaces or unknown words.
An embedding layer is not only for masking (masking is just an option in an embedding layer).
The embedding layer transforms integer values from a word/character dictionary into actual vectors of a certain shape.
Suppose you have this dictionary:
1: hey
2: ,
3: I'm
4: here
5: not
And you form sentences like
[1,2,3,4,0] -> this is "hey, I'm here"
[1,2,3,5,4] -> this is "hey, I'm not here"
[1,2,1,2,1] -> this is "hey, hey, hey"
The embedding layer will tranform each of those integers into vectors of a certain size. This does two good things at the same time:
Transforms the words in vectors because neural networks can only handle vectors or intensities. A list of indices cannot be processed by a neural network directly, there is no logical relation between indices and words
Creates a vector that will be a "meaningful" set of features for each word.
And after training, they become "meaningful" vectors. Each element starts to represent a certain feature of the word, although that feature is obscure to humans. It's possible that an embedding be capable of detecting words that are verbs, nouns, feminine, masculine, etc, everything encoded in a combination of numeric values (presence/abscence/intensity of features).
You may also try the approach in this question, which instead of using masking, needs to separate batches by length, so each batch can be trained at a time without needing to pad them: Keras misinterprets training data shape

Variable size multi-label candidate sampling in tensorflow?

nce_loss() asks for a static int value for num_true. That works well for problems where we have the same amount of labels per training example and we know it in advance.
When labels have a variable shape [None], and being batched and/or bucketed by bucket size with .padded_batch() + .group_by_window() it is necessary to provide a variable size num_true in order to accustom for all training examples. This is currently unsupported to my knowledge (correct me if I'm wrong).
In other words suppose we have either a dataset of images with an arbitrary amount of labels per each image (dog, cat, duck, etc.) or a text dataset with numerous multiple classes per sentence (class_1, class_2, ..., class_n). Classes are NOT mutually exclusive, and can vary in size between examples.
But as the amount of possible labels can be huge 10k-100k is there a way to do a sampling loss to improve performance (in comparison with a sigmoid_cross_entropy)?
Is there a proper way to do this or any other workarounds?
nce_loss = tf.nn.nce_loss(
weights=nce_weights,
biases=nce_biases,
labels=labels,
inputs=inputs,
num_sampled=num_sampled,
# Something like this:
# `num_true=(tf.shape(labels)[-1])` instead of `num_true=const_int`
# , would be preferable here
num_classes=self.num_classes)
I see two issues:
1) Work with NCE with different numbers of true values;
2) Classes that are NOT mutually exclusive.
To the first issue, as #michal said, there is an expectative of including this functionality in the future. I have tried almost the same thing: to use labels with shape=(None, None), i.e., true_values dimension None. The sampled_values parameter has the same problem: true_values number must be a fixed integer number. The recomended work around is to use a class (0 is the best one) representing <PAD> and complete the number of true_values. In my case, 0 is an special token that represents <PAD>. Part of code is here:
assert len(labels) <= (window_size * 2)
zeros = ((window_size * 2) - len(labels)) * [0]
labels = labels + zeros
labels.sort()
I sorted the label because considering another recommendation:
Note: By default this uses a log-uniform (Zipfian) distribution for
sampling, so your labels must be sorted in order of decreasing
frequency to achieve good results.
In my case, the special tokens and more frequent words have lower indexes, otherwise, less frequent words have higher indexes. I included all label classes associated to the input at same time and completed with zero till the true_values number. Of course, you must ignore the 0 class at the end.

Setting up the input on an RNN in Keras

So I had a specific question with setting up the input in Keras.
I understand that the sequence length refers to the window length of the longest sequence that you are looking to model with the rest being padded by 0's.
However, how do I set up something that is already in a time series array?
For example, right now I have an array that is 550k x 28. So there are 550k rows each with 28 columns (27 features and 1 target). Do I have to manually split the array into (550k- sequence length) different arrays and feed all of those to the network?
Assuming that I want to the first layer to be equivalent to the number of features per row, and looking at the past 50 rows, how do I size the input layer?
Is that simply input_size = (50,27), and again do I have to manually split the dataset up or would Keras automatically do that for me?
RNN inputs are like: (NumberOfSequences, TimeSteps, ElementsPerStep)
Each sequence is a row in your input array. This is also called "batch size", number of examples, samples, etc.
Time steps are the amount of steps for each sequence
Elements per step is how much info you have in each step of a sequence
I'm assuming the 27 features are inputs and relate to ElementsPerStep, while the 1 target is the expected output having 1 output per step.
So I'm also assuming that your output is a sequence with also 550k steps.
Shaping the array:
Since you have only one sequence in the array, and this sequence has 550k steps, then you must reshape your array like this:
(1, 550000, 28)
#1 sequence
#550000 steps per sequence
#28 data elements per step
#PS: this sequence is too long, if it creates memory problems to you, maybe it will be a good idea to use a `stateful=True` RNN, but I'm explaining the non stateful method first.
Now you must split this array for inputs and targets:
X_train = thisArray[:, :, :27] #inputs
Y_train = thisArray[:, :, 27] #targets
Shaping the keras layers:
Keras layers will ignore the batch size (number of sequences) when you define them, so you will use input_shape=(550000,27).
Since your desired result is a sequence with same length, we will use return_sequences=True. (Else, you'd get only one result).
LSTM(numberOfCells, input_shape=(550000,27), return_sequences=True)
This will output a shape of (BatchSize, 550000, numberOfCells)
You may use a single layer with 1 cell to achieve your output, or you could stack more layers, considering that the last one should have 1 cell to match the shape of your output. (If you're using only recurrent layers, of course)
stateful = True:
When you have sequences so long that your memory can't handle them well, you must define the layer with stateful=True.
In that case, you will have to divide X_train in smaller length sequences*. The system will understand that every new batch is a sequel of the previous batches.
Then you will need to define batch_input_shape=(BatchSize,ReducedTimeSteps,Elements). In this case, the batch size should not be ignored like in the other case.
* Unfortunately I have no experience with stateful=True. I'm not sure about whether you must manually divide your array (less likely, I guess), or if the system automatically divides it internally (more likely).
The sliding window case:
In this case, what I often see is people dividing the input data like this:
From the 550k steps, get smaller arrays with 50 steps:
X = []
for i in range(550000-49):
X.append(originalX[i:i+50]) #then take care of the 28th element
Y = #it seems you just exclude the first 49 ones from the original