I'm not sure if my understanding is correct but...
While training a seq2seq model, one of the purpose I want to initiated a set of pre-trained fasttext weights in the embedding layers is to decrease the unknown words in the test environment (these unknown words are not in training set). Since pre-trained fasttext model has larger vocabulary, during test environment, the unknown word can be represented by fasttext out-of-vocabulary word vectors, which supposed to have similar direction of the semantic similar words in the training set.
However, due to the fact that the initial fasttext weights in the embedding layers will be updated through the training process (updating weights generates better results). I am wondering if the updated embedding weights would distort the relationship of semantic similarity between words and undermine the representation of fasttext out-of-vocabulary word vectors? (and, between those updated embedding weights and word vectors in the initial embedding layers but their corresponding ID didn't appear in the training data)
If the input ID can be distributed represented vectors extracted from pre-trained model and, then, map these pre-trained word vectors (fixed weights while training) via a lookup table to the embedding layers (these weights will be updated while training), would it be a better solution?
Any suggestions will be appreciated!
You are correct about the problem: when using pre-trained vector and fine-tuning them in your final model, the words that are infrequent or hasn't appear in your training set won't get any updates.
Now, usually one can test how much of the issue for your particular case this is. E.g. if you have a validation set, try fine-tuning and not fine-tuning the weights and see what's the difference in model performance on validation set.
If you see a big difference in performance on validation set when you are not fine-tuning, here is a few ways to handle this:
a) Add a linear transformation layer after not-trainable embeddings. Fine-tuning embeddings in many cases does affine transformations to the space, so one can capture this in a separate layer that can be applied at test time.
E.g. A is pre-trained embedding matrix:
embeds = tf.nn.embedding_lookup(A, tokens)
X = tf.get_variable("X", [embed_size, embed_size])
b = tf.get_vairable("b", [embed_size])
embeds = tf.mul(embeds, X) + b
b) Keep pre-trained embeddings in the not-trainable embedding matrix A. Add trainable embedding matrix B, that has a smaller vocab of popular words in your training set and embedding size. Lookup words both in A and B (and if word is out of vocab use ID=0 for example), concat results and use it input to your model. This way you will teach your model to use mostly A and sometimes rely on B for popular words in your training set.
fixed_embeds = tf.nn.embedding_lookup(A, tokens)
B = tf.get_variable("B", [smaller_vocab_size, embed_size])
oov_tokens = tf.where(tf.less(tokens, smaller_vocab_size), tokens, tf.zeros(tf.shape(tokens), dtype=tokens.dtype))
dyn_embeds = tf.nn.embedding_lookup(B, oov_tokens)
embeds = tf.concat([fixed_embeds, dyn_embeds], 1)
Related
Tensoflow Embedding Layer (https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) is easy to use,
and there are massive articles talking about
"how to use" Embedding (https://machinelearningmastery.com/what-are-word-embeddings/, https://www.sciencedirect.com/topics/computer-science/embedding-method)
.
However, I want to know the Implemention of the very "Embedding Layer" in Tensorflow or Pytorch.
Is it a word2vec?
Is it a Cbow?
Is it a special Dense Layer?
Structure wise, both Dense layer and Embedding layer are hidden layers with neurons in it. The difference is in the way they operate on the given inputs and weight matrix.
A Dense layer performs operations on the weight matrix given to it by multiplying inputs to it ,adding biases to it and applying activation function to it. Whereas Embedding layer uses the weight matrix as a look-up dictionary.
The Embedding layer is best understood as a dictionary that maps integer indices (which stand for specific words) to dense vectors. It takes integers as input, it looks up these integers in an internal dictionary, and it returns the associated vectors. It’s effectively a dictionary lookup.
from keras.layers import Embedding
embedding_layer = Embedding(1000, 64)
Here 1000 means the number of words in the dictionary and 64 means the dimensions of those words. Intuitively, embedding layer just like any other layer will try to find vector (real numbers) of 64 dimensions [ n1, n2, ..., n64] for any word. This vector will represent the semantic meaning of that particular word. It will learn this vector while training using backpropagation just like any other layer.
When you instantiate an Embedding layer, its weights (its internal dictionary of token vectors) are initially random, just as with any other layer. During training, these word vectors are gradually adjusted via backpropagation, structuring the space into something the downstream model can exploit. Once fully trained, the embedding space will show a lot of structure—a kind of structure specialized for the specific problem for which you’re training your model.
-- Deep Learning with Python by F. Chollet
Edit - How "Backpropagation" is used to train the look-up matrix of the Embedding Layer ?
Embedding layer is similar to the linear layer without any activation function. Theoretically, Embedding layer also performs matrix multiplication but doesn't add any non-linearity to it by using any kind of activation function. So backpropagation in the Embedding layer is similar to as of any linear layer. But practically, we don't do any matrix multiplication in the embedding layer because the inputs are generally one hot encoded and the matrix multiplication of weights by a one-hot encoded vector is as easy as a look-up.
Based on text description (varing length of 1000 words), i have to tag into one or more possible label values (for single training sample).
Any suggesting algorithm to proceed further.
More concered in backpropogation in tensorflow (Loss, optimizer & accuracy).
Would be using glove vector (Make use of Pretrained from stanfold) as embedding layer.
I know what embeddings are and how they are trained. Precisely, while referring to the tensorflow's documentation, I came across two different articles. I wish to know what exactly is the difference between them.
link 1: Tensorflow | Vector Representations of words
In the first tutorial, they have explicitly trained embeddings on a specific dataset. There is a distinct session run to train those embeddings. I can then later on save the learnt embeddings as a numpy object and use the
tf.nn.embedding_lookup() function while training an LSTM network.
link 2: Tensorflow | Embeddings
In this second article however, I couldn't understand what is happening.
word_embeddings = tf.get_variable(“word_embeddings”,
[vocabulary_size, embedding_size])
embedded_word_ids = tf.gather(word_embeddings, word_ids)
This is given under the training embeddings sections. My doubt is: does the gather function train the embeddings automatically? I am not sure since this op ran very fast on my pc.
Generally: What is the right way to convert words into vectors (link1 or link2) in tensorflow for training a seq2seq model? Also, how to train the embeddings for a seq2seq dataset, since the data is in the form of separate sequences for my task unlike (a continuous sequence of words refer: link 1 dataset)
Alright! anyway, I have found the answer to this question and I am posting it so that others might benefit from it.
The first link is more of a tutorial that steps you through the process of exactly how the embeddings are learnt.
In practical cases, such as training seq2seq models or Any other encoder-decoder models, we use the second approach where the embedding matrix gets tuned appropriately while the model gets trained.
I am still relatively new to the world of Deep Learning. I wanted to create a Deep Learning model (preferably using Tensorflow/Keras) for image anomaly detection. By anomaly detection I mean, essentially a OneClassSVM.
I have already tried sklearn's OneClassSVM using HOG features from the image. I was wondering if there is some example of how I can do this in deep learning. I looked up but couldn't find one single code piece that handles this case.
The way of doing this in Keras is with the KerasRegressor wrapper module (they wrap sci-kit learn's regressor interface). Useful information can also be found in the source code of that module. Basically you first have to define your Network Model, for example:
def simple_model():
#Input layer
data_in = Input(shape=(13,))
#First layer, fully connected, ReLU activation
layer_1 = Dense(13,activation='relu',kernel_initializer='normal')(data_in)
#second layer...etc
layer_2 = Dense(6,activation='relu',kernel_initializer='normal')(layer_1)
#Output, single node without activation
data_out = Dense(1, kernel_initializer='normal')(layer_2)
#Save and Compile model
model = Model(inputs=data_in, outputs=data_out)
#you may choose any loss or optimizer function, be careful which you chose
model.compile(loss='mean_squared_error', optimizer='adam')
return model
Then, pass it to the KerasRegressor builder and fit with your data:
from keras.wrappers.scikit_learn import KerasRegressor
#chose your epochs and batches
regressor = KerasRegressor(build_fn=simple_model, nb_epoch=100, batch_size=64)
#fit with your data
regressor.fit(data, labels, epochs=100)
For which you can now do predictions or obtain its score:
p = regressor.predict(data_test) #obtain predicted value
score = regressor.score(data_test, labels_test) #obtain test score
In your case, as you need to detect anomalous images from the ones that are ok, one approach you can take is to train your regressor by passing anomalous images labeled 1 and images that are ok labeled 0.
This will make your model to return a value closer to 1 when the input is an anomalous image, enabling you to threshold the desired results. You can think of this output as its R^2 coefficient to the "Anomalous Model" you trained as 1 (perfect match).
Also, as you mentioned, Autoencoders are another way to do anomaly detection. For this I suggest you take a look at the Keras Blog post Building Autoencoders in Keras, where they explain in detail about the implementation of them with the Keras library.
It is worth noticing that Single-class classification is another way of saying Regression.
Classification tries to find a probability distribution among the N possible classes, and you usually pick the most probable class as the output (that is why most Classification Networks use Sigmoid activation on their output labels, as it has range [0, 1]). Its output is discrete/categorical.
Similarly, Regression tries to find the best model that represents your data, by minimizing the error or some other metric (like the well-known R^2 metric, or Coefficient of Determination). Its output is a real number/continuous (and the reason why most Regression Networks don't use activations on their outputs). I hope this helps, good luck with your coding.
I'm very interested in GAN those times.
I coded one for MNIST with the following structure :
Generator model
Discriminator model
Gen + Dis model
Generator model generate batches of image from random distribution.
Discrimator is trained over it and real images.
Then Discriminator is freeze in Gen+Dis model and Generator trained. (With the frozen Discriminator who says if the generator is good or not)
Now, imagine I don't want to feed my generator with a random distribution but with images. (For upscaling for example, or generate an real image from a draw)
Do I need to change something in it ?
(Except the conv model who will be more complex)
Should I continue to use the binary_crossentropy as loss function ?
Thanks you very much!
You can indeed put a variational autoencoder (VAE) in front in order to generate the initial distribution z (see paper).
If you are interested in the topic I can recommend the this course at Kadenze.