Use FastText trained embeddings in mxnet symbol embedding layer - mxnet

How do you run fasttext on a corpus and use those embeddings in mxnet symbol embedding layer?

To do that you first need to load the matrix that contains FastText embedding and then pass it as an initializer to the embedding layer:
embed_layer_3 = mx.sym.Embedding(data=input_x_3, weight=the_emb_3, input_dim=vocab_size, output_dim=embedding_dim, name='vocab_embed')
I took this example from here, where they use Glove Embedding, but the idea is the same.
I would highly recommend to use Gluon API instead of Symbol API. In that case it will be much easier for you to use all goodness of GluonNLP package, which already has pretrained FastText embedding. See this tutorial to learn how to use Fasttext in GluonNLP

Related

Can I feed categorical data in Keras embedding layer without encoding the data?

I am trying to feed multicolumn categorical data into Keras embedding layer. Can I feed categorical data in Keras embedding layer without encoding ?
If not then which encoding method is preferable to retrieve contextual information from the categorical data ?
No you cannot feed categorical data into Keras embedding layer without encoding the data.
There are couple of ways to encode the data:
Integer Encoding: Where each unique label is mapped to an integer.
One Hot Encoding: Where each label is mapped to a binary vector.
Learned Embedding: Where a distributed representation of the categories is learned.
The most preferred method to retrieve contextual information from the categorical data is Learned Embedding method.
You could use any pertained embeddings from below:
Glove Embeddings (https://nlp.stanford.edu/projects/glove/)
Word2Vec.
ConceptNet (https://github.com/commonsense/conceptnet-numberbatch)
ELMo embeddings (https://github.com/yuanxiaosc/ELMo)
ELMo embeddings code usage example:
import tensorflow_hub as hub
import tensorflow as tf
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True))

TF-Hub Elmo uses which word embedding to concatenate with characters in Highway layer

I understand that Elmo uses CNN over characters for character embeddings. However I do not understand how the character embeddings are concatenated with word embeddings in the Highway network. In the Elmo paper most of the evaluations use Glove for word embeddings and CNN character embedding together which make sense as they have mentioned the word embeddings.
But for pre-trained models like the one in TF-Hub with which word embeddings do we concatenate with character embeddings in Highway layer?
Please help me understand if you can.
Concatenation happens inside the https://tfhub.dev/google/elmo/3 model. When using word_emb output, one can get the embedding for each token in the input. The embedding can be used for classification or other modeling tasks similar to BERT/transformer based models. The model also provides direct access to the some hidden state of the LSTM through lstm_outputs1 and lstm_outputs2.

Save trained gensim word2vec model as a tensorflow SavedModel

Do we have an option to save a trained Gensim Word2Vec model as a saved model using tf 2.0 tf.saved_model.save? In other words, how can I save a trained embedding vector as a saved model signature to work with tensorflow 2.0. The following steps are not correct normally:
model = gensim.models.Word2Vec(...)
model.init_sims(..)
model.train(..)
model.save(..)
module = gensim.models.KeyedVectors.load_word2vec(...)
tf.saved_model.save(
module,
export_dir
)
EDIT:
This example helped me about how to do it : https://keras.io/examples/nlp/pretrained_word_embeddings/
Gensim does not use TensorFlow and it has its own methods for loading and saving models.
You would need to convert Gensim embeddings into a TensorFlow a model which only makes sense if you further plan to use your embeddings within TensorFlow and possibly fine-tune them for your task.
Gensim Word2Vec are two steps in TensorFlow:
Vocabulary lookup: a table that assigns indices to tokens.
Embedding lookup layer that picks up the actual embeddings for the indices.
Then, you can save it as any other TensorFlow model.

Trainable USE-lite-based classifier with SentencePiece input

I have heard that it is possible to use the pretrained Universal Sentence Encoder (USE) (neural language model) from TF-hub as part of a trainable model, e.g. a sentence classifier. Some versions of USE rely on SentencePiece sub-word tokenizer, which I also need. There are minimal instructions online for how to do this.
Here is how to use USE-lite with SentencePiece:
- https://tfhub.dev/google/universal-sentence-encoder-lite/2
Here is how to train a classifier based on a pretrained USE model:
- http://hunterheidenreich.com/blog/google-universal-sentence-encoder-in-keras/
- https://www.youtube.com/watch?v=gnz1CUzb5qo
And here is how to measure sentence similarity using both USE-lite and SentencePiece:
- https://github.com/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder_lite.ipynb
I have successfully reproduced the above pieces separately. I have then tried to combine the above ideas into a single POC that will build a classifier model based on USE-lite and SentencePiece, but I cannot see how to do it. I am currently stuck on the part where I modify the trainable classifier's first layer(s). I have tried to make it accept either (1) SentencePiece token IDs (in which I tokenize the text outide of the Tensorflow graph) or (2) raw text (using SentencePiece as an Op inside the Tensorflow graph). After that point, it should feed tokenized text forward into the USE-lite model, either in a lambda or in some other way. Finally, the output of USE-lite should be fed into a dense layer (or two?) ending in softmax for computing class probabilities.
I am relatively new to Tensorflow. I imagine that the above sources would be sufficient for a more experienced Tensorflow developer to merge and make work for my use-case. Let me know if you can provide any pointers. Thanks.

Use pre-trained word2vec in lstm language model?

I used tensorflow to train LSTM language model, code is from here.
According to article here, it seems that if I use pre-trained word2vec, it works better.
Using word embeddings such as word2vec and GloVe is a popular method to improve the accuracy of your model. Instead of using one-hot vectors to represent our words, the low-dimensional vectors learned using word2vec or GloVe carry semantic meaning – similar words have similar vectors. Using these vectors is a form of pre-training.
So, I want to use word2vec to redo the training, but I am a little bit confused about how to do this.
The embedding code goes here:
with tf.device("/cpu:0"):
embedding = tf.get_variable(
"embedding", [vocab_size, size], dtype=data_type())
inputs = tf.nn.embedding_lookup(embedding, input_.input_data)
How can I change this code to use pre-trained word2vec?