Keras autoencoder and getting the compressed feature vector representation - tensorflow

I have one sentence per row in a file and sentences are not more than 30 words. I am building an autoencoder using Keras and I am very new to this - so I may be doing few things incorrectly. So, help me out here.
I am trying to use autoencoder to get the intermediate context vector - the compressed feature vectors after the encode step.
Vocabulary is nothing but a list of distinct words in my file. 300 is the dimension of word embedding matrix. 30 is the maximum words each sentence can have. X_train is (#of sentence, 30) matrix of numbers where each number is nothing but where in the dictionary the word existed.
print len(vocabulary)
model = Sequential()
model.add(Embedding(len(vocabulary), 300))
model.compile('rmsprop', 'mse')
input_i = Input(shape=(30, 300))
encoded_h1 = Dense(64, activation='tanh')(input_i)
encoded_h2 = Dense(32, activation='tanh')(encoded_h1)
encoded_h3 = Dense(16, activation='tanh')(encoded_h2)
encoded_h4 = Dense(8, activation='tanh')(encoded_h3)
encoded_h5 = Dense(4, activation='tanh')(encoded_h4)
latent = Dense(2, activation='tanh')(encoded_h5)
decoder_h1 = Dense(4, activation='tanh')(latent)
decoder_h2 = Dense(8, activation='tanh')(decoder_h1)
decoder_h3 = Dense(16, activation='tanh')(decoder_h2)
decoder_h4 = Dense(32, activation='tanh')(decoder_h3)
decoder_h5 = Dense(64, activation='tanh')(decoder_h4)
output = Dense(300, activation='tanh')(decoder_h5)
autoencoder = Model(input_i,output)
autoencoder.compile('adadelta','mse')
X_embedded = model.predict(X_train)
autoencoder.fit(X_embedded,X_embedded,epochs=10, batch_size=256, validation_split=.1)
print autoencoder.summary()
The idea is taken from Keras - Autoencoder for Text Analysis
So, after training (if I have done correctly) how should I just run the encoding part for each sentence to get the feature representation? Help is appreciated. Thanks!

make a standalone model for encoder
encoder=Model(input_i,latent)
suppose for mnist data the code should be like--
encoder.predict(x_train[0])
by this you will get latent_space vector as output

To do this, refer to 'popping' off last layer via model.pop(). After training "pop off" the last layer by using model.pop(), then use model.predict(X_train) to get the representation.
https://keras.io/getting-started/faq/#how-can-i-remove-a-layer-from-a-sequential-model

Related

Deep Learning model (LSTM) predicts same class label

I am trying to solve the Spoken Digit Recognition task using the LSTM model, where the audio files are converted into spectrograms and fed into an LSTM model after doing Global Average Pooling. Here is the architecture of it
tf.keras.backend.clear_session()
#input layer
input_= Input(shape = (64, 35))
lstm = LSTM(100, activation='tanh', return_sequences= True, kernel_regularizer = l2(0.000001),
recurrent_initializer = 'glorot_uniform')(input_)
lstm = GlobalAveragePooling1D(data_format='channels_first')(lstm)
dense = Dense(20, activation='relu', kernel_regularizer = l2(0.000001), kernel_initializer='glorot_uniform')(lstm)
drop = Dropout(0.8)(dense)
dense1 = Dense(25, activation='relu', kernel_regularizer = l2(0.000001), kernel_initializer= 'he_uniform')(drop)
drop = Dropout(0.95)(dense1)
output = Dense(10,activation = 'softmax', kernel_regularizer = l2(0.000001), kernel_initializer= 'glorot_uniform')(drop)
model_2 = Model(inputs = [input_], outputs = output)
model_2.summary()
Having summary as -
I need to calculate the F1 score to check the performance of the model, I have implemented a custom callback and used TensorFlow addons F1 score too. However, I won't get the correct result, for every epoch I get the constant F1 score value.
On further digging, I found out that my model predicts the same class label, for the entire epoch, whereas it is supposed to predict 10 classes in one epoch. as there are 10 class label values present.
Here is my model.compile and model.predict commands. I have used TensorFlow addon here -
from tensorflow import keras
opt = keras.optimizers.Adam(0.001, clipnorm=0.8)
model_2.compile(loss='categorical_crossentropy', optimizer=opt, metrics = metric)
hist = model_2.fit([X_train_spectrogram],
[y_train_converted],
validation_data= ([X_test_spectrogram], [y_test_converted]),
epochs = 10,
verbose =1,
callbacks=[tensorBoard_callbk2, ClearMemory()],
# steps_per_epoch = 3,
batch_size=32)
Here is what I mean by getting the same prediction, the entire array is filled with the same predicted values.
Why is the model predicting the same class label? or How to rectify it?
I have tried increasing the number of trainable parameters, increasing - decreasing batch size too, but it won't help me. If anyone knows can you please help me out?

Why does my model learn with Ragged Tensors but not Dense Tensors?

I have a string of letters that follow a "grammar." I also have boolean labels on my training set of whether the string follows "the grammar" or not. Basically, my model is trying to learn determine if a string of letters follows the rules. It's a fairly simple problem (I got it out of a textbook).
I am generating my dataset like this:
def generate_dataset(size):
good_strings = [string_to_ids(generate_string(embedded_reber_grammar))
for _ in range(size // 2)]
bad_strings = [string_to_ids(generate_corrupted_string(embedded_reber_grammar))
for _ in range(size - size // 2)]
all_strings = good_strings + bad_strings
X = tf.ragged.constant(all_strings, ragged_rank=1)
# X = X.to_tensor(default_value=0)
y = np.array([[1.] for _ in range(len(good_strings))] +
[[0.] for _ in range(len(bad_strings))])
return X, y
Notice the line X = X.to_tensor(default_value=0). If this line is commented out, my model learns just fine. However, if it is not commented out, it fails to learn and the validation set performs the same as chance (50-50).
Here is my actual model:
np.random.seed(42)
tf.random.set_seed(42)
embedding_size = 5
model = keras.models.Sequential([
keras.layers.InputLayer(input_shape=[None], dtype=tf.int32, ragged=True),
keras.layers.Embedding(input_dim=len(POSSIBLE_CHARS) + 1, output_dim=embedding_size),
keras.layers.GRU(30),
keras.layers.Dense(1, activation="sigmoid")
])
optimizer = keras.optimizers.SGD(lr=0.02, momentum = 0.95, nesterov=True)
model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"])
history = model.fit(X_train, y_train, epochs=5, validation_data=(X_valid, y_valid))
I am using 0 as the default value for the dense tensors. The strings_to_ids doesn't use 0 for any of the values but instead starts at 1. Also, when I switch to using a Dense tensor I change ragged=True to False. I have no idea why using a dense tensor causes the model to fail, as I've used dense tensors before in similar exercises.
For additional details, see the solution from the book (exercise 8) or my own colab notebook.
So turns out the answer was that the shape of the dense tensor was different across the training set and validation set. This was because the longest sequence differed in length between the two sets (same with the test set).

How to train any Hugging face transformer model (eg DistilBERT) for question answer from scratch using Tensorflow backend?

I want to understand how to train a hugging face transformer model (like BERT, DistilBERT, etc) for the question-answer system and TensorFlow as backend. Following is the logic that I am currently using (but I am not sure whether is it right approach):
I am using SQuAD v1.1 dataset.
In SQuAd dataset answer to any question is always present in context. So to put in simple words I am trying to predict start index and end index and answer.
I have transformed the dataset for the same purpose. I have added the start index and end index of on word level after performing tokenization. Here is how my dataset looks,
Next I am encoded question and context as per hugging face docs guide and returing input_ids, attention_ids and token_type_ids; which will be used as input to model.
def tokenize(questions, contexts):
input_ids, input_masks, input_segments = [],[],[]
for question,context in tqdm_notebook(zip(questions, contexts)):
inputs = tokenizer.encode_plus(question,context, add_special_tokens=True, max_length=512, pad_to_max_length=True,return_attention_mask=True, return_token_type_ids=True )
input_ids.append(inputs['input_ids'])
input_masks.append(inputs['attention_mask'])
input_segments.append(inputs['token_type_ids'])
return [np.asarray(input_ids, dtype='int32'), np.asarray(input_masks, dtype='int32'), np.asarray(input_segments, dtype='int32')]
Finally I define a Keras model which takes this three input and predict two value, start and end word index of answer from given context.
input_ids_in = tf.keras.layers.Input(shape=(512,), name='input_token', dtype='int32')
input_masks_in = tf.keras.layers.Input(shape=(512,), name='masked_token', dtype='int32')
input_segment_in = tf.keras.layers.Input(shape=(512,), name='segment_token', dtype='int32')
embedding_layer = transformer_model({'inputs':input_ids_in,'attention_mask':input_masks_in,
'token_type_ids':input_segment_in})[0]
X = tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(50, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(embedding_layer)
X = tf.keras.layers.GlobalMaxPool1D()(X)
start_branch = tf.keras.layers.Dense(1024, activation='relu')(X)
start_branch = tf.keras.layers.Dropout(0.3)(start_branch)
start_branch_output = tf.keras.layers.Dense(512, activation='softmax', name='start_branch')(start_branch)
end_branch = tf.keras.layers.Dense(1024, activation='relu')(X)
end_branch = tf.keras.layers.Dropout(0.3)(end_branch)
end_branch_output = tf.keras.layers.Dense(512, activation='softmax', name='end_branch')(end_branch)
model = tf.keras.Model(inputs=[input_ids_in, input_masks_in, input_segment_in], outputs = [start_branch_output, end_branch_output])
I am using last softmax layer with 512 units as it is my max no of words I my aim is to predict index dromit.

How to get output from a specific layer in keras.tf, the bottleneck layer in autoencoder?

I am developing an autoencoder for clustering certain groups of images.
input_images->...->bottleneck->...->output_images
I have calibrated the autoencoder to my satisfaction and saved the model; everything has been developed using keras.tensorflow on python3.
The next step is to apply the autoencoder to a ton of images and cluster them according to cosine distance in the bottleneck layer. Oops, I just realized that I don't know the syntax in keras.tf for running the model on a batch up to a specific layer rather than to the output layer. Thus the question:
How do I run something like Model.predict_on_batch or Model.predict_generator up to the certain "bottleneck" layer and retrieve the values on that layer rather than the values on the output layer?
You need to define a new model (if you didn't define the encoder and decoder as separate models initially, which is usually the easiest option).
If your model was defined without reusing layers, it's just:
inputs = model.input
outputs= model.get_layer('bottleneck').output
encoder = Model(inputs, outputs)
Use the encoder model as any other model.
The full code would be like this,
# ENCODER
encoding_dim = 37310
input_layer = Input(shape=(encoding_dim,))
encoder = Dense(500, activation='tanh')(input_layer)
encoder = Dense(100, activation='tanh')(encoder)
encoder = Dense(50, activation='tanh', name='bottleneck_layer')(encoder)
decoder = Dense(100, activation='tanh')(encoder)
decoder = Dense(500, activation='tanh')(decoder)
decoder = Dense(37310, activation='sigmoid')(decoder)
# full model
model_full = models.Model(input_layer, decoder)
model_full.compile(optimizer='adam', loss='mse')
model_full.fit(x, y, epochs=20, batch_size=16)
# bottleneck model
bottleneck_output = model_full.get_layer('bottleneck_layer').output
model_bottleneck = models.Model(inputs = model_full.input, outputs = bottleneck_output)
bottleneck_predictions = model_bottleneck.predict(X_test)

Reduce dimension: best architecture

I have a dataset:
100 timesteps
10 variables
for example,
dataset = np.arange(1000).reshape(100,10)
The 10 variables are related to each other. So I want to reduce its dimension from 10 to 1.
Also, 100 time steps are related.
Which deep learning architecture is suitable for it guys?
edit:
from keras.models import Sequential
from keras.layers import LSTM, Dense
X = np.arange(1000).reshape(100,10)
model = Sequential()
model.add(LSTM(input_shape = (100, 10), return_sequences=False))
model.add(Dense(1))
model.compile(loss='mse', optimizer='adam')
model.fit(???, epochs=50, batch_size=5)
In order to compress your data, the best course of action is to use an autoencoder.
Autoencoder architecture:
Input ---> Encoder (reduces dimensionality of the input) ----> Decoder (tries to recreate input) ---> Lossy version of input
By extracting the trained encoder, we can find a way to represent your data using fewer dimensions.
from keras.layers import Input, Dense
from keras.models import Model
input = Input(shape=(10,)) #will take an input in shape of (num_samples, 10)
encoded = Dense(1, activation='relu')(input) #returns a 1D vector from input
decoded = Dense(10, activation='sigmoid)(encoded) #tries to recreate input from 1D vector
autoencoder = Model(input, decoded) #input image ---> lossy reconstruction from decoded
Now that we have the autoencoder, we need to extract what you really want- the part encoder that reduces the input's dimensionality:
encoder = Model(input, encoded) #maps input to reduced-dimension encoded form
Compile and train the autoencoder:
autoencoder.compile(optimizer='adam', loss='mse')
X = np.arange(1000).reshape(100, 10)
autoencoder.fit(X, X, batch_size=5, epochs=50)
Now you can use the encoder to reduce dimensionality:
encoded_form = encoder.predict(<something with shape (samples, 10)>) #outs 1D vector
You probably also want the decoder as well. If you are going to use it put this block of code right before you compile and fit the autoencoder:
encoded_form = Input(shape=(1,))
decoder_layer = autoencoder.layers[-1]
decoder = Model.(encoded_form, decoder_layer(encoded_form))