This question is rather abstract and not necessarily tied to tensorflow or keras. Say that you want to train a language model, and you want to use inputs of different sizes for your LSTMs. Particularly, I'm following this paper: https://www.researchgate.net/publication/317379370_A_Neural_Language_Model_for_Query_Auto-Completion.
The authors use, among other things, word embeddings and one-hot encoding of characters. Most likely, the dimensions of each of these inputs are different. Now, to feed that into a network, I see a few alternatives but I'm sure I'm missing something and I would like to know how it should be done.
Create a 3D tensor of shape (instances, 2, max(embeddings,characters)). That is, padding the smaller input with 0s.
Create a 3D tensor of shape (instances, embeddings+characters, 1)). That is, concatenating inputs.
It looks to me that both alternatives are bad for efficiently training the model. So, what's the best way to approach this? I see the authors use an embedding layer for this purpose, but technically, what does that mean?
EDIT
Here are more details. Let's call these inputs X (character-level input) and E (word-level input). On each character of a sequence (a text), I compute x, e and y, the label.
x: character one-hot encoding. My character index is of size 38, so this is a vector filled with 37 zeros and one 1.
e: precomputed word embedding of dimension 200. If the character is a space, I fetch the word embedding of the previous word in the sequence, Otherwise, I assign the vector for incomplete word (INC, also of size 200). Real example with the sequence "red car": r>INC, e>INC, d>INC, _>embeddings["red"], c>INC, a>INC, r>INC.
y: the label to be predicted, which is the next character, one-hot encoded. This output is of the same dimension as x because it uses the same character index. In the example above, for "r", y is the one-hot encoding of "e".
According to keras documentation, the padding idea seems to be the one. There is the masking parameter in the embedding layer, that will make keras skip these values instead of processing them. In theory, you don't lose that much performance. If the library is well built, the skipping is actually skipping extra processing.
You just need to take care not to attribute the value zero to any other character, not even spaces or unknown words.
An embedding layer is not only for masking (masking is just an option in an embedding layer).
The embedding layer transforms integer values from a word/character dictionary into actual vectors of a certain shape.
Suppose you have this dictionary:
1: hey
2: ,
3: I'm
4: here
5: not
And you form sentences like
[1,2,3,4,0] -> this is "hey, I'm here"
[1,2,3,5,4] -> this is "hey, I'm not here"
[1,2,1,2,1] -> this is "hey, hey, hey"
The embedding layer will tranform each of those integers into vectors of a certain size. This does two good things at the same time:
Transforms the words in vectors because neural networks can only handle vectors or intensities. A list of indices cannot be processed by a neural network directly, there is no logical relation between indices and words
Creates a vector that will be a "meaningful" set of features for each word.
And after training, they become "meaningful" vectors. Each element starts to represent a certain feature of the word, although that feature is obscure to humans. It's possible that an embedding be capable of detecting words that are verbs, nouns, feminine, masculine, etc, everything encoded in a combination of numeric values (presence/abscence/intensity of features).
You may also try the approach in this question, which instead of using masking, needs to separate batches by length, so each batch can be trained at a time without needing to pad them: Keras misinterprets training data shape
Related
I want to recommend an item complementary to a cart of items. So, naturally, I thought of using embeddings to represent items, and I came up to a layer of this kind in keras:
item_input = Input(shape=(MAX_CART_SIZE,), name="item_id")
item_embedding = Embedding(input_dim=NB_ITEMS+1, input_length=MAX_CART_SIZE, output_dim=EMBEDDING_SIZE, mask_zero=True)
I used masking to handle the variable size of the carts. So, the dimensions of the output tensor of this layer is MAX_CART_SIZE x EMBEDDING_SIZE. It means that there are as many different embeddings as there are potential items. In other words, a item can be encoded a different way according to its position within the cart and that's an undesirable behavior... Though, it seems that most neural networks dealing with NLP data work this way, with embeddings not associated with words but with words/indices within a phrase.
So, what would be the correct way to preserve order invariance? In other words, I'd like the cart A,B,C be stricly equivalent to the carts C,B,A or B,A,C in terms of input representation and generated output.
One way of having invariance will be done by using a Transformer architecture WITHOUT using positional embeddings. In this way, each item is encoded to an embedding, and because you do not have a positional embedding, the object embedding is the same even if it is one the first position or on the last one.
Moreover, the Transformer architecture is invariant to such positions as long as you avoid the positional embedding.
I have seen in many blogs , people using one_hot (from tf.keras.preprocessing.text.one_hot ) to convert the string of words into array of numbers which represent indices. This does not ensure unicity. Whereas Tokenizer class ensures unicity (tf.keras.preprocessing.text.Tokenizer ).
Then why is one_hot prefered over tokenizer?
Update: I got to know that hashing is used in One_hot to convert words into numbers but didn't get its importance as we can use the tokenizer class to do the same thing with more accuracy.
Not sure what you mean by uncity. I expect it has to do with the sequential relationship between the words. That of course is lost with ine hot encoding. However one-hot encoding is used when the number of words is limited. If say you have 10 words in the vocubulary you will create 10 new features which is fine for most neural networks to process. If you have other features in your data set beside the word sequences say numeric ordinal parameters you can still create a single input model. However if you have 10,000 words in the vocabulary you would create 10,000 new features which at best will take a lot to process. So in the case of a large vocabularly it is best to use "dense" encoding" versus the sparse encoding generated by one hot encoding. You can use the results of the tokenizer encoding to serve as input to a keras embedding layer which will encode the words into an n dimensional space where N is a value you specify. If you have additional ordinal features then to process the data your model will need multiple inputs. Perhaps that is why some people prefer to one hot encode the words.
I have been using keras.layers.Embedding for almost all of my projects. But, recently I wanted to fiddle around with tf.data and found feature_column.embedding_column.
From the documentation:
feature_column.embedding_column -
DenseColumn that converts from sparse, categorical input.
Use this when your inputs are sparse, but you want to convert them to a dense
representation (e.g., to feed to a DNN).
keras.layers.Embedding - Turns positive integers (indexes) into dense vectors of fixed size.
e.g. [[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]
This layer can only be used as the first layer in a model.
My question is, is both of the api doing similar thing on different type of input data(for ex. input - [0,1,2] for keras.layers.Embedding and its one-hot-encoded rep. [[1,0,0],[0,1,0],[0,0,1] for feature_column.embedding_column)?
After reviewing source code for both operations here is what I found:
both operations rely on tensorflow.python.ops.embedding_ops funcitonality;
keras.layers.Embedding uses dense representations and contains generic keras code for fiddling with shapes, init variables etc;
feature_column.embedding_column relies on sparse and contains functionality to cache results.
So, your guess seems to be right: these 2 are doing similar things, rely on distinct input representations, contain some logic that doesn't change the essense of what they do.
I am a bit confused on padding, my first question is:
Is it possible to pad a shorter sequence with values that are not 0? How do you deal with that then in the RNN?
Generally a 0 is used for padding, is there a specific reason why? Does it make it easy in the training because it does not affect the calculation or you still need to mask the loss function?
In case your sentence is composed of vectors embedding from a word2vec model, would padding be applied as a zero vector?
Thanks in advance fir any hint!
Your question is addressed in How to overcome training example's different lengths when working with Word Embeddings (word2vec).
For details on the alternating min/max padding method, see Apply word embeddings to entire document, to get a feature vector.
See also: keras.preprocessing.sequence.pad_sequences, which can take a value to pad with as an argument.
I'm trying to implement the paper "End-to-End memory networks" (http://arxiv.org/abs/1503.08895)
Each training example consists of a number of phrases, a question and then the answer. The number of sentences is variable, as is the number of words in each sentence and the question. Each word is encoded as an integer. So my input would have the form [batch size, # of sentences, # words in sentence].
Now my problem is that the second and third dimension are unknown for each mini-batch. Can I still somehow represent this input as a single tensor or do I have to use lists of tensors, so that I have a list of length batch_size, and then a sublist of length number of sentences and then for each sentence a tensor, whose size is also not known in advance, corresponding to the words encoded as integers.
Can I use this second approach or will tensorflow then not be able to backpropagate, e.g. I have an operation where I have to calculate the following sum: \sum_i tf.scalar_mul(p_i, c_i), where p_i is a scalar and c_i is an embedding vector that was previously calculated. The tensors for the p and c values are then stored in a list, so I would have to sum over the elements in the two lists in a loop. I'm assuming that tensorflow would not be able to incoorporate this loop in the computation graph, correct? I'm sceptical since theano has a special scan function that allows one to loop over input, so I'm assuming that a regular loop would cause problems in the computation graph. How does tensorflow handle this?
Moving Yaroslav's comment to an answer:
TensorFlow has tf.scan. Dimensions may also be dynamic as in Theano.