How to use fasttext vectors in a tensorflow embedding layer? - tensorflow

I just struggle to find out how I can use Fasttext wordvectors for OOV words in a keras/tensorflow embedding layer. There is nothing out there. Maybe someone has thought of that too and has some hints for me?
The way via word embedding look up works via indices like
tf.nn.embedding_lookup(word_embeddings, x)
And you could have an index for one OOV. But how can I assign a specific vector (from a different and custom source like fasttext) at runtime?
I imagine a function which can customly assign a vector to the UNK index for a OOV word.
Related to that:
Assign custom word vector to UNK token during prediction?
Using subword information in OOV token from fasttext in word embedding layer (keras/tensorflow)

You can do the embedding lookup \ computation outside of tensorflow, and use the embedded text as input to the model (so the input wont be a sequence of word indices, but a sequence of vectors)

Related

How to concatenate Glove 100d embedding and 1d array which contains additional signal?

I new to NLP and trying out some text classification algorithms. I have 100d GloVe vector representing each entry as a list of embeddings. Also, I have NER feature of shape (2234,) which shows if there is named entity or not. Array with GloVe embeddings is of shape (2234, 100).
How to correctly concatenate these array so each row represents its word?
Sorry for not including reproducible example. Please, use variables of your choice to explain the concatenation procedure.
Using np.concatenate did not work as I have expected but i don't know how to deal with dimensionality of embeddings.
Just in case someone accidentally gets here. Use:
my_arr.reshape(2234,1)
Don't be me:)

Size of input and output layers in Keras implementation of an RNN Language Model

As part of my thesis, I am trying to build a recurrent Neural Network Language Model.
From theory, I know that the input layer should be a one-hot vector layer with a number of neurons equal to the number of words of our Vocabulary, followed by an Embedding layer, which, in Keras, it apparently translates to a single Embedding layer in a Sequential model. I also know that the output layer should also be the size of our vocabulary so that each output value maps 1-1 to each vocabulary word.
However, in both the Keras documentation for the Embedding layer (https://keras.io/layers/embeddings/) and in this article (https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/#comment-533252), the vocabulary size is arbitrarily augmented by one for both the input and the output layers! Jason gives an explenation that this is due to the implementation of the Embedding layer in Keras but that doesn't explain why we would also use +1 neuron in the output layer. I am at the point of wanting to order the possible next words based on their probabilities and I have one probability too many that I do not know to which word to map it too.
Does anyone know what is the correct way of acheiving the desired result? Did Jason just forget to subtrack one from the output layer and the Embedding layer just needs a +1 for implementation reasons (I mean it's stated in the official API)?
Any help on the subject would be appreciated (why is Keras API documentation so laconic?).
Edit:
This post Keras embedding layer masking. Why does input_dim need to be |vocabulary| + 2? made me think that Jason does in fact have it wrong and that the size of the Vocabulary should not be incremented by one when our word indices are: 0, 1, ..., n-1.
However, when using Keras's Tokenizer our word indices are: 1, 2, ..., n. In this case, the correct approach is to:
Set mask_zero=True, to treat 0 differently, as there is never a
0 (integer) index input in the Embedding layer and keep the
vocabulary size the same as the number of vocabulary words (n)?
Set mask_zero=True but augment the vocabulary size by one?
Not set mask_zero=True and keep the vocabulary size the same as the
number of vocabulary words?
the reason why we add +1 leads to the possibility that we can encounter a chance to see an unseen word(out of our vocabulary) during testing or in production, it is common to consider a generic term for those UNKNOWN and that is why we add a OOV word in front which resembles all out of vocabulary words.
Check this issue on github which explains it in detail:
https://github.com/keras-team/keras/issues/3110#issuecomment-345153450

Extract embedding of a certain token from a 3D tensor

Each training example contains a sentence (a list of tokens), a start token index and an end token index. The start and end index highlight some words in the sentence. The label is either True or False, which indicates the highlighted word is "interesting" or not.
For example, "news" is the highlighted word in this example.
tokens=["This", "news", "is", "sad"], start=1, end=2, label=1
Here is my model setup.
The sentences are feeded to the pre-trainied bert layer, which returns a 3D tensor which contain s the embedding of each tokens in every sentences.
I'd like to extract the embedding of the highligted words of each exampel from the 3D tensor and get the mean of them, so that I can feed it to a dense layer, how should I do that? Assuming I already have the 3D tensor.
I am using keras API proivded by tensorflow 2.0.
I am new to tensorflow, so a concrete example would be much appreciated.
Thanks!

How to use Transformers for text classification?

I have two questions about how to use Tensorflow implementation of the Transformers for text classifications.
First, it seems people mostly used only the encoder layer to do the text classification task. However, encoder layer generates one prediction for each input word. Based on my understanding of transformers, the input to the encoder each time is one word from the input sentence. Then, the attention weights and the output is calculated using the current input word. And we can repeat this process for all of the words in the input sentence. As a result we'll end up with pairs of (attention weights, outputs) for each word in the input sentence. Is that correct? Then how would you use this pairs to perform a text classification?
Second, based on the Tensorflow implementation of transformer here, they embed the whole input sentence to one vector and feed a batch of these vectors to the Transformer. However, I expected the input to be a batch of words instead of sentences based on what I've learned from The Illustrated Transformer
Thank you!
There are two approaches, you can take:
Just average the states you get from the encoder;
Prepend a special token [CLS] (or whatever you like to call it) and use the hidden state for the special token as input to your classifier.
The second approach is used by BERT. When pre-training, the hidden state corresponding to this special token is used for predicting whether two sentences are consecutive. In the downstream tasks, it is also used for sentence classification. However, my experience is that sometimes, averaging the hidden states give a better result.
Instead of training a Transformer model from scratch, it is probably more convenient to use (and eventually finetune) a pre-trained model (BERT, XLNet, DistilBERT, ...) from the transformers package. It has pre-trained models ready to use in PyTorch and TensorFlow 2.0.
The Transformers are designed to take the whole input sentence at once. The main motive for designing a transformer was to enable parallel processing of the words in the sentences. This parallel processing is not possible in LSTMs or RNNs or GRUs as they take words of the input sentence as input one by one.
So in the encoder part of the transformers, the very first layer contains the number of units equal to the number of words in a sentence and then each unit converts that word into an embedding vector corresponding to that word. Further, the rest of the processes are carried out. For more details, you can go through the article: http://jalammar.github.io/illustrated-transformer/
How to use this transformer for text classification - Since in text classification our output is a single number not a sequence of numbers or vectors so we can remove the decoder part and just use the encoder part. The output of the encoder is a set of vectors, the same in number as the number of words in the input sentence. Further, we can feed these sets of output vectors into a CNN, or we can add an LSTM or RNN model and perform classification.
The input is the whole sentence or batch of sentences not word by word. Surely you would have misunderstood it.

Analogies from Word2Vec in TensorFlow?

I implemented Word Embeddings in Tensor Flow similarly to the code here I was able to get the final embeddings (final_embeddings), but I would like to evaluate the embeddings using the analogies typical of this exercise. How can I identify which term corresponds to which row in the final embeddings array? Alternatively, is there an implementation in Tensor Flow for this? Any help would be greatly appreciated (specifics and resources would be a plus;) ). Thanks!
Recommend this conceptual tutorial to you.
If you are using skip-gram, the input is one-hot encoding. So the index of the 1 is the index of the vector of the word.
The implementation in tensorflow is quite simple. You may want to see this function: tf.nn.embedding_lookup
For example:
embed = tf.nn.embedding_lookup(embedding, inputs)
The embed is the vector you are looking for.
Which term corresponds to which row in the final embeddings array is completely up to your implementation. At some point before your training you converted each word to a number, right? This number indicates that row in your embeddings table.
If you want to know the specific name, you could post part of your code here.