Can I feed categorical data in Keras embedding layer without encoding the data? - tensorflow

I am trying to feed multicolumn categorical data into Keras embedding layer. Can I feed categorical data in Keras embedding layer without encoding ?
If not then which encoding method is preferable to retrieve contextual information from the categorical data ?

No you cannot feed categorical data into Keras embedding layer without encoding the data.
There are couple of ways to encode the data:
Integer Encoding: Where each unique label is mapped to an integer.
One Hot Encoding: Where each label is mapped to a binary vector.
Learned Embedding: Where a distributed representation of the categories is learned.
The most preferred method to retrieve contextual information from the categorical data is Learned Embedding method.
You could use any pertained embeddings from below:
Glove Embeddings (https://nlp.stanford.edu/projects/glove/)
Word2Vec.
ConceptNet (https://github.com/commonsense/conceptnet-numberbatch)
ELMo embeddings (https://github.com/yuanxiaosc/ELMo)
ELMo embeddings code usage example:
import tensorflow_hub as hub
import tensorflow as tf
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True))

Related

What does the embedding elements stand for in huggingFace bert model?

Prior to passing my tokens through encoder in BERT model, I would like to perform some processing on their embeddings. I extracted the embedding weight using:
from transformers import TFBertModel
# Load a pre-trained BERT model
model = TFBertModel.from_pretrained('bert-base-uncased')
# Get the embedding layer of the model
embedding_layer = model.get_layer('bert').get_input_embeddings()
# Extract the embedding weights
embedding_weights = embedding_layer.get_weights()
I found it contains 5 elements as shown in Figure.
enter image description here
In my understanding, the first three elements are the word embedding weights, token type embedding weights, and positional embedding weights. My question is what does the last two elements stand for?
I dive deep into the source code of bert model. But I cannot figure out the meaning of the last two elements.
In bert model, there is a post-processing of the embedding tensor that uses layer normalization followed by dropout ,
https://github.com/google-research/bert/blob/eedf5716ce1268e56f0a50264a88cafad334ac61/modeling.py#L362
I think that those two arrays are the gamma and beta of the normalization layer, https://www.tensorflow.org/api_docs/python/tf/keras/layers/LayerNormalization
They are learned parameters, and will span the axes of inputs specified in param "axis" which defaults to -1 (corresponding to 768 in embedding tensor).

TensorFlow keras expected dimension not expected LSTM

I'm trying to complete the LSTM composing using TensorFlow from
https://www.datacamp.com/tutorial/using-tensorflow-to-compose-music
I've got so far as the LSTM model, but a dimension error for the inputs is given. I've used the code provided in the tutorial. A new trainingset is created, but not converted, as for the auto encoder models in the examples above.
This piece of code is not included in the preparation step for the LSTM model
# Convert to one-hot encoding and swap chord and sequence dimensions
trainChords = tf.keras.utils.to_categorical(trainChords).transpose(0,2,1)
# Convert data to numpy array of type float
trainChords = np.array(trainChords, np.float)
# Flatten sequence of chords into single dimension
trainChordsFlat = trainChords.reshape(nSamples, nChordsSequence)
What do these steps do? Are they also required for the LSTM model?

How to use pretrain embedding vector by feature_column

I have a pretrain embedding vector like this {"key":[0.21123813 -0.09532269 0.11912347 -0.28437278 -0.5040968 -0.3963967 0.073469564 0.33775213 -0.118199855 -0.12064915]} and want to use it in keras, how do I create a feature column in feature_column?

Save trained gensim word2vec model as a tensorflow SavedModel

Do we have an option to save a trained Gensim Word2Vec model as a saved model using tf 2.0 tf.saved_model.save? In other words, how can I save a trained embedding vector as a saved model signature to work with tensorflow 2.0. The following steps are not correct normally:
model = gensim.models.Word2Vec(...)
model.init_sims(..)
model.train(..)
model.save(..)
module = gensim.models.KeyedVectors.load_word2vec(...)
tf.saved_model.save(
module,
export_dir
)
EDIT:
This example helped me about how to do it : https://keras.io/examples/nlp/pretrained_word_embeddings/
Gensim does not use TensorFlow and it has its own methods for loading and saving models.
You would need to convert Gensim embeddings into a TensorFlow a model which only makes sense if you further plan to use your embeddings within TensorFlow and possibly fine-tune them for your task.
Gensim Word2Vec are two steps in TensorFlow:
Vocabulary lookup: a table that assigns indices to tokens.
Embedding lookup layer that picks up the actual embeddings for the indices.
Then, you can save it as any other TensorFlow model.

how to package string labels into savedModel

I have string labels such as "cat", "dog". Can I feed string labels directly to deep learning models in Tensorflow and get string labels as predictions? I am looking for the equivalent of sklearn's labelEncoder sklearn.preprocessing import LabelEncoder
If this is not possible, is there a way to pack the labels into savedModel protobuf file and retrieve them based on indices during serving time? I am using Estimator's export_savedModel API. Is assets_extra the right way? The one at https://github.com/tensorflow/serving/issues/55 does not use savedModel format.
The typical way to handle label data in deep learning is to embed the labels in a vector space. Language models do it routinely with word embedding. TensorFlow provides embedding lookup operations that you can use for you purposes.