Tensorflow embeddings - tensorflow

I know what embeddings are and how they are trained. Precisely, while referring to the tensorflow's documentation, I came across two different articles. I wish to know what exactly is the difference between them.
link 1: Tensorflow | Vector Representations of words
In the first tutorial, they have explicitly trained embeddings on a specific dataset. There is a distinct session run to train those embeddings. I can then later on save the learnt embeddings as a numpy object and use the
tf.nn.embedding_lookup() function while training an LSTM network.
link 2: Tensorflow | Embeddings
In this second article however, I couldn't understand what is happening.
word_embeddings = tf.get_variable(“word_embeddings”,
[vocabulary_size, embedding_size])
embedded_word_ids = tf.gather(word_embeddings, word_ids)
This is given under the training embeddings sections. My doubt is: does the gather function train the embeddings automatically? I am not sure since this op ran very fast on my pc.
Generally: What is the right way to convert words into vectors (link1 or link2) in tensorflow for training a seq2seq model? Also, how to train the embeddings for a seq2seq dataset, since the data is in the form of separate sequences for my task unlike (a continuous sequence of words refer: link 1 dataset)

Alright! anyway, I have found the answer to this question and I am posting it so that others might benefit from it.
The first link is more of a tutorial that steps you through the process of exactly how the embeddings are learnt.
In practical cases, such as training seq2seq models or Any other encoder-decoder models, we use the second approach where the embedding matrix gets tuned appropriately while the model gets trained.

Related

Trainable USE-lite-based classifier with SentencePiece input

I have heard that it is possible to use the pretrained Universal Sentence Encoder (USE) (neural language model) from TF-hub as part of a trainable model, e.g. a sentence classifier. Some versions of USE rely on SentencePiece sub-word tokenizer, which I also need. There are minimal instructions online for how to do this.
Here is how to use USE-lite with SentencePiece:
- https://tfhub.dev/google/universal-sentence-encoder-lite/2
Here is how to train a classifier based on a pretrained USE model:
- http://hunterheidenreich.com/blog/google-universal-sentence-encoder-in-keras/
- https://www.youtube.com/watch?v=gnz1CUzb5qo
And here is how to measure sentence similarity using both USE-lite and SentencePiece:
- https://github.com/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder_lite.ipynb
I have successfully reproduced the above pieces separately. I have then tried to combine the above ideas into a single POC that will build a classifier model based on USE-lite and SentencePiece, but I cannot see how to do it. I am currently stuck on the part where I modify the trainable classifier's first layer(s). I have tried to make it accept either (1) SentencePiece token IDs (in which I tokenize the text outide of the Tensorflow graph) or (2) raw text (using SentencePiece as an Op inside the Tensorflow graph). After that point, it should feed tokenized text forward into the USE-lite model, either in a lambda or in some other way. Finally, the output of USE-lite should be fed into a dense layer (or two?) ending in softmax for computing class probabilities.
I am relatively new to Tensorflow. I imagine that the above sources would be sufficient for a more experienced Tensorflow developer to merge and make work for my use-case. Let me know if you can provide any pointers. Thanks.

Questions about tensorflow GetStarted tutorial

So I was reading the tensorflow getstarted tutorial and I found it very hard to follow. There were a lot of explanations missing about each function and why they are necesary (or not).
In the tf.estimator section, what's the meaning or what are they supposed to be the "x_eval" and "y_eval" arrays? The x_train and y_train arrays give the desired output (which is the corresponding y coordinate) for a given x coordinate. But the x_eval and y_eval values are incorrect: for x=5, y should be -4, not -4.1. Where do those values come from? What do x_eval and y_eval mean? Are they necesary? How did they choose those values?
The difference between "input_fn" (what does "fn" even mean?) and "train_input_fn". I see that the only difference is one has
num_epochs=None, shuffle=True
num_epochs=1000, shuffle=False
but I don't understand what "input_fn" or "train_input_fn" are/do, or what's the difference between the two, or if both are necesary.
3.In the
estimator.train(input_fn=input_fn, steps=1000)
piece of code, I don't understand the difference between "steps" and "num_epochs". What's the meaning of each one? Can you have num_epochs=1000 and steps=1000 too?
The final question is, how do i get the W and the b? In the previous way of doing it (not using tf.estimator) they explicitelly found that W=-1 and b=1. If I was doing a more complex neural network, involving biases and weights, I think I would want to recover the actual values of the weights and biases. That's the whole point of why I'm using tensorflow, to find the weights! So how do I recover them in the tf.estimator example?
These are just some of the questions that bugged me while reading the "getStarted" tutorial. I personally think it leaves a lot to desire, since it's very unclear what each thing does and you can at best guess.
I agree with you that the tf.estimator is not very well introduced in this "getting started" tutorial. I also think that some machine learning background would help with understanding what happens in the tutorial.
As for the answers to your questions:
In machine learning, we usually minimizer the loss of the model on the training set, and then we evaluate the performance of the model on the evaluation set. This is because it is easy to overfit the training set and get 100% accuracy on it, so using a separate validation set makes it impossible to cheat in this way.
Here (x_train, y_train) corresponds to the training set, where the global minimum is obtained for W=-1, b=1.
The validation set (x_eval, y_eval) doesn't have to perfectly follow the distribution of the training set. Although we can get a loss of 0 on the training set, we obtain a small loss on the validation set because we don't have exactly y_eval = - x_eval + 1
input_fn means "input function". This is to indicate that the object input_fn is a function.
In tf.estimator, you need to provide an input function if you want to train the estimator (estimator.train()) or evaluate it (estimator.evaluate()).
Usually you want different transformations for training or evaluation, so you have two functions train_input_fn and eval_input_fn (the input_fn in the tutorial is almost equivalent to train_input_fn and is just confusing).
For instance, during training we want to train for multiple epochs (i.e. multiple times on the dataset). For evaluation, we only need one pass over the validation data to compute the metrics we need
The number of epochs is the number of times we repeat the entire dataset. For instance if we train for 10 epochs, the model will see each input 10 times.
When we train a machine learning model, we usually use mini-batches of data. For instance if we have 1,000 images, we can train on batches of 100 images. Therefore, training for 10 epochs means training on 100 batches of data.
Once the estimator is trained, you can access the list of variables through estimator.get_variable_names() and the value of a variable through estimator.get_variable_value().
Usually we never need to do that, as we can for instance use the trained estimator to predict on new examples, using estimator.predict().
If you feel that the getting started is confusing, you can always submit a GitHub issue to tell the TensorFlow team and explain your point.

Weights update in Tensorflow embedding layer with pretrained fasttext weights

I'm not sure if my understanding is correct but...
While training a seq2seq model, one of the purpose I want to initiated a set of pre-trained fasttext weights in the embedding layers is to decrease the unknown words in the test environment (these unknown words are not in training set). Since pre-trained fasttext model has larger vocabulary, during test environment, the unknown word can be represented by fasttext out-of-vocabulary word vectors, which supposed to have similar direction of the semantic similar words in the training set.
However, due to the fact that the initial fasttext weights in the embedding layers will be updated through the training process (updating weights generates better results). I am wondering if the updated embedding weights would distort the relationship of semantic similarity between words and undermine the representation of fasttext out-of-vocabulary word vectors? (and, between those updated embedding weights and word vectors in the initial embedding layers but their corresponding ID didn't appear in the training data)
If the input ID can be distributed represented vectors extracted from pre-trained model and, then, map these pre-trained word vectors (fixed weights while training) via a lookup table to the embedding layers (these weights will be updated while training), would it be a better solution?
Any suggestions will be appreciated!
You are correct about the problem: when using pre-trained vector and fine-tuning them in your final model, the words that are infrequent or hasn't appear in your training set won't get any updates.
Now, usually one can test how much of the issue for your particular case this is. E.g. if you have a validation set, try fine-tuning and not fine-tuning the weights and see what's the difference in model performance on validation set.
If you see a big difference in performance on validation set when you are not fine-tuning, here is a few ways to handle this:
a) Add a linear transformation layer after not-trainable embeddings. Fine-tuning embeddings in many cases does affine transformations to the space, so one can capture this in a separate layer that can be applied at test time.
E.g. A is pre-trained embedding matrix:
embeds = tf.nn.embedding_lookup(A, tokens)
X = tf.get_variable("X", [embed_size, embed_size])
b = tf.get_vairable("b", [embed_size])
embeds = tf.mul(embeds, X) + b
b) Keep pre-trained embeddings in the not-trainable embedding matrix A. Add trainable embedding matrix B, that has a smaller vocab of popular words in your training set and embedding size. Lookup words both in A and B (and if word is out of vocab use ID=0 for example), concat results and use it input to your model. This way you will teach your model to use mostly A and sometimes rely on B for popular words in your training set.
fixed_embeds = tf.nn.embedding_lookup(A, tokens)
B = tf.get_variable("B", [smaller_vocab_size, embed_size])
oov_tokens = tf.where(tf.less(tokens, smaller_vocab_size), tokens, tf.zeros(tf.shape(tokens), dtype=tokens.dtype))
dyn_embeds = tf.nn.embedding_lookup(B, oov_tokens)
embeds = tf.concat([fixed_embeds, dyn_embeds], 1)

Tensorflow: jointly training CNN + LSTM

There are quite a few examples on how to use LSTMs alone in TF, but I couldn't find any good examples on how to train CNN + LSTM jointly.
From what I see, it is not quite straightforward how to do such training, and I can think of a few options here:
First, I believe the simplest solution (or the most primitive one) would be to train CNN independently to learn features and then to train LSTM on CNN features without updating the CNN part, since one would probably have to extract and save these features in numpy and then feed them to LSTM in TF. But in that scenario, one would probably have to use a differently labeled dataset for pretraining of CNN, which eliminates the advantage of end to end training, i.e. learning of features for final objective targeted by LSTM (besides the fact that one has to have these additional labels in the first place).
Second option would be to concatenate all time slices in the batch
dimension (4-d Tensor), feed it to CNN then somehow repack those
features to 5-d Tensor again needed for training LSTM and then apply a cost function. My main concern, is if it is possible to do such thing. Also, handling variable length sequences becomes a little bit tricky. For example, in prediction scenario you would only feed single frame at the time. Thus, I would be really happy to see some examples if that is the right way of doing joint training. Besides that, this solution looks more like a hack, thus, if there is a better way to do so, it would be great if someone could share it.
Thank you in advance !
For joint training, you can consider using tf.map_fn as described in the documentation https://www.tensorflow.org/api_docs/python/tf/map_fn.
Lets assume that the CNN is built along similar lines as described here https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10.py.
def joint_inference(sequence):
inference_fn = lambda image: inference(image)
logit_sequence = tf.map_fn(inference_fn, sequence, dtype=tf.float32, swap_memory=True)
lstm_cell = tf.contrib.rnn.LSTMCell(128)
output_state, intermediate_state = tf.nn.dynamic_rnn(cell=lstm_cell, inputs=logit_sequence)
projection_function = lambda state: tf.contrib.layers.linear(state, num_outputs=num_classes, activation_fn=tf.nn.sigmoid)
projection_logits = tf.map_fn(projection_function, output_state)
return projection_logits
Warning: You might have to look into device placement as described here https://www.tensorflow.org/tutorials/using_gpu if your model is larger than the memory gpu can allocate.
An Alternative would be to flatten the video batch to create an image batch, do a forward pass from CNN and reshape the features for LSTM.

Building a conversational model using TensorFlow

I'd like to build a conversational modal that can predict a sentence using the previous sentences using TensorFlow LSTMs . The example provided in TensorFlow tutorial can be used to predict the next word in a sentence .
https://www.tensorflow.org/versions/v0.6.0/tutorials/recurrent/index.html
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
loss = 0.0
for current_batch_of_words in words_in_dataset:
# The value of state is updated after processing each batch of words.
output, state = lstm(current_batch_of_words, state)
# The LSTM output can be used to make next word predictions
logits = tf.matmul(output, softmax_w) + softmax_b
probabilities = tf.nn.softmax(logits)
loss += loss_function(probabilities, target_words)
Can I use the same technique to predict the next sentence ? Is there any working example on how to do this ?
You want to use the Sequence-to-sequence model. Instead of having it learn to translate sentences from a source language to a target language you have it learn responses to previous utterances in the conversation.
You can adapt the example seq2seq model in tensorflow by using the analogy that the source language 'English' is your set of previous sentences and target language 'French' are your response sentences.
In theory you could use the basic LSTM you were looking at by concatenating your training examples with a special symbol like this:
hello there ! __RESPONSE hi , how can i help ?
Then during testing you run it forward with a sequence up to and including the __RESPONSE symbol and the LSTM can carry it the rest of the way.
However, the seq2seq model above should be much more accurate and powerful because it had a separate encoder / decoder and includes an attention mechanism.
A sentence is composed words, so you can indeed predict the next sentence by predicting words sequentially. There are models, such as the one described in this paper, that build embeddings for entire paragraphs, which can be useful for your purpose. Of course there is Neural Conversational Model work that probably directly fits your need. TensorFlow doesn't ship with working examples of these models, but the recurrent models that come with TensorFlow should give you a good starting point for implementing them.