How do I add an embedding layer in Keras starting from a pd dataframe? - tensorflow

I am trying to build a neural network using both categorical and numerical inputs using Keras to predict student grades ranging from 0-20.
My dataset is already split into train and test sets (two separate dataframes). I split the training set into numerical and categorical attributes. There are 17 categorical attributes and 16 numerical ones. Each categorical column only contains 3-4 categories so I have used OneHotEncoding to transform them. However, it creates unnecessary columns and I would like to experiment with embedding since it's more efficient.
I don't understand what I need to do in order to feed the categorical inputs into the neural model.
This is what my basic neural network looks like.
input = keras.layers.Input(shape= 58,) #additional columns created through OHE
hidden1 = keras.layers.Dense(300, activation="relu")(input)
hidden2 = keras.layers.Dense(300, activation="relu")(hidden1)
concat = keras.layers.Concatenate()([input,hidden2])
output = keras.layers.Dense(21, activation = "softmax")(concat) model = keras.Model(inputs=[input], outputs=[output])
How can I expand it to include an embedding layer? Can I embed all the categorical columns together or would I need to add a layer for each?
I am using sparse categorical crossentropy as my loss function, but I guess I could use a different one now that the categorical inputs have been vectorized?
model.compile(loss="sparse_categorical_crossentropy", optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3))
I am very new to ML and NNs so apologies if my question is unclear.

Related

How to prioritise certain output in MultiOutput LSTM Tensorflow?

Basically, I am creating an LSTM model with Tensorflow and the shape of my input data is something like
(10000 users, 6 timesteps, 20 feature columns) => (10000,6,20)
The model is doing a binary classification using LSTM with 20 output columns giving the shape of (10000, 20).
PS. I'm not doing classification with 20 classes, I'm doing a classification that gives 20 binary outputs for each person
Is it possible to prioritise certain output columns like giving weights or importance to certain columns more than others so that when we train the model it punishes incorrect predictions for these more important output columns more than others or would it make more sense to create separate models for these important columns?
It's easy to use class weights with TensorFlow for this purpose. See the class_weight parameter for model.fit(): https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit

Which Loss function & Metrics is more suitable for multi-label classification? Binary or Categorical cross-entropy and Why?

According to my knowledge(please correct me if I'm wrong),
Multi-label classification(mutually inclusive) i.e., samples might have more than 1 correct values (for example movie genre, disease detection, etc).
Multi-Class classification(mutually exclusive) i.e., samples will always have 1 correct value (for example Cat or Dog, object detection, etc) this includes Binary Classification.
Assuming output is one-hot encoding.
What are the Loss function and metrics on has to use for these 2 types?
loss func. metrics
1. multi-label: (binary, categorical) (binary_accuracy, TopKCategorical accuracy, categorical_accuracy, AUC)
2. multi-class: (binary) (binary_accuracy,f1, recall, precision)
Please tell me from the above table which of them is/are more suitable, which of them is/are wrong & Why?
If you are trying to use multi-class classification provided that the labels (y) is one hot encoded, use the loss function as categorical crossentropy and use adam optimizer (It is suitable for most cases). Also, while using multi-class classification, the number of output nodes should be the same as the number of classes (or) labels. Say if your model is going to classify the input into 4 classes, You can configure the output layer as follows..
model.add(4, activation = "softmax")
Also, forgot to mention that softmax activation should be used in the output layer for multiclass classification problems.
Incase if your y is not one hot encoded, I would advise you to choose the loss function as sparse categorical crossentropy. No other changes will be necessary.
Also, I usually split the data into test data and train data and feed them to the model like this to get the accuracy in each epoch..
history = model.fit(train_data, validation_data = test_data, epochs = 10)
Hope it solved your problem.

Extract the output of the embedding layer

I am trying to build a regression model, for which I have a nominal variable with very high cardinality. I am trying to get the categorical embedding of the column.
Input:
df["nominal_column"]
Output:
the embeddings of the column.
I want to use the op of the embedding column alone since I would require that as a input to my traditional regression model. Is there a way to extract that output alone.
P.S I am not asking for code, any suggestion on the approach would be great.
If the embedding is part of the model and you train it, then you can use functional API of keras to get output of any intermediate operation in your graph:
x=Input((number_of_categories,))
y=Embedding(parameters_of_your_embeddings)(x)
output=Rest_of_your_model()(y)
model=Model(inputs=[x],outputs=[output,y])
if you do it before you train the model, you'll have to define custom loss function, that deals only with part of the output. The other way is to train the model with just one output, then create identical model with two outputs and set the weights of the second model from the trained one.
If you want to get the embedding matrix from your model, you can just use method get_weights of the embedding layer which returns the weights in numpy array.

binarize input for pytorch

may I ask how to make data loaded in pytorch become binarized once it is loaded?
Like Tensorflow can done this through:
train_data = mnist.input_data.read_data_sets(data_directory, one_hot=True)
How can pytorch achieve the one_hot=True effect.
The data_loader I have now is:
torch.set_default_tensor_type('torch.FloatTensor')
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('data/', train=True, download=True,
transform=transforms.Compose([
# transforms.RandomHorizontalFlip(),
transforms.ToTensor()])),
batch_size=batch_size, shuffle=False)
I want to make data in train_loader be binarized.
Now what I am doing is: After loading the data,
for data,_ in train_loader:
torch.round(data)
data = Variable(data)
Use the torch.round() function. Is this correct?
The one-hot encoding idea is used for classification. It sounds like you are trying to create an autoencoder perhaps.
If you are creating an autoencoder then there is no need to round as BCELoss can handle values between 0 and 1. Note when training it is better not to apply the sigmoid and instead to use BCELossWithLogits as it provides numerical stability.
Here is an example of an autoencoder with MNIST
If instead you are attempting to do classifcation then there is no need for a one hot vector, you simply output the number of neurons equal to the number of classes i.e for MNIST output 10 neurons and then pass it to CrossEntropyLoss along with a LongTensor with corresponding expected class values
Here is an example of classification on MNIST

Weights update in Tensorflow embedding layer with pretrained fasttext weights

I'm not sure if my understanding is correct but...
While training a seq2seq model, one of the purpose I want to initiated a set of pre-trained fasttext weights in the embedding layers is to decrease the unknown words in the test environment (these unknown words are not in training set). Since pre-trained fasttext model has larger vocabulary, during test environment, the unknown word can be represented by fasttext out-of-vocabulary word vectors, which supposed to have similar direction of the semantic similar words in the training set.
However, due to the fact that the initial fasttext weights in the embedding layers will be updated through the training process (updating weights generates better results). I am wondering if the updated embedding weights would distort the relationship of semantic similarity between words and undermine the representation of fasttext out-of-vocabulary word vectors? (and, between those updated embedding weights and word vectors in the initial embedding layers but their corresponding ID didn't appear in the training data)
If the input ID can be distributed represented vectors extracted from pre-trained model and, then, map these pre-trained word vectors (fixed weights while training) via a lookup table to the embedding layers (these weights will be updated while training), would it be a better solution?
Any suggestions will be appreciated!
You are correct about the problem: when using pre-trained vector and fine-tuning them in your final model, the words that are infrequent or hasn't appear in your training set won't get any updates.
Now, usually one can test how much of the issue for your particular case this is. E.g. if you have a validation set, try fine-tuning and not fine-tuning the weights and see what's the difference in model performance on validation set.
If you see a big difference in performance on validation set when you are not fine-tuning, here is a few ways to handle this:
a) Add a linear transformation layer after not-trainable embeddings. Fine-tuning embeddings in many cases does affine transformations to the space, so one can capture this in a separate layer that can be applied at test time.
E.g. A is pre-trained embedding matrix:
embeds = tf.nn.embedding_lookup(A, tokens)
X = tf.get_variable("X", [embed_size, embed_size])
b = tf.get_vairable("b", [embed_size])
embeds = tf.mul(embeds, X) + b
b) Keep pre-trained embeddings in the not-trainable embedding matrix A. Add trainable embedding matrix B, that has a smaller vocab of popular words in your training set and embedding size. Lookup words both in A and B (and if word is out of vocab use ID=0 for example), concat results and use it input to your model. This way you will teach your model to use mostly A and sometimes rely on B for popular words in your training set.
fixed_embeds = tf.nn.embedding_lookup(A, tokens)
B = tf.get_variable("B", [smaller_vocab_size, embed_size])
oov_tokens = tf.where(tf.less(tokens, smaller_vocab_size), tokens, tf.zeros(tf.shape(tokens), dtype=tokens.dtype))
dyn_embeds = tf.nn.embedding_lookup(B, oov_tokens)
embeds = tf.concat([fixed_embeds, dyn_embeds], 1)