Output logits with softmax aren't extreme when using Tensorflow. No prediction is very confident - tensorflow

I trained a text classification model on data in TensorFlow and plotted the SoftMax confidence for the correct prediction as well as the SoftMax confidence for incorrect predictions. When I did this I noticed that there were no output predictions with a high logit/class probability. For example, predicting between 4 classes had these results:
(TensorFlow version)
input text: "Text that fits into class 0"
logits: [.3928, 0.2365, 0.1854, 0.1854]
label: class 0
I would hope that the logit output for class 0 would be higher than .3928! Looking at the graph you can see that none of the prediction logits output a number higher than (.5).
Next, I retrained the exact same model and dataset but in Pytorch. With Pytorch, I got the results I was looking for. Both models had the exact same validation accuracy after training. (90% val accuracy)
(Pytorch Version)
input text: "Text that fits into class 0"
logits: [0.8778, 0.0532, 0.0056, 0.0635]
label: class 0
Here is what I understand about the softmax function:
The softmax function transforms its inputs, which can be seen as logits, into a probability distribution over the classes. The maximum value of the softmax is 1, so if your largest logit is 0.5, then it means that the highest probability assigned by the softmax will be relatively low (less than 1/2).
To have more extreme outputs, you need to have larger logits. One way to do this is to train your model with more data and a more powerful architecture, so that it can learn more complex relationships between inputs and outputs. It may not be desirable for a model to have extremely high confidence in its predictions, as it could lead to overfitting to the training data. The appropriate level of confidence will depend on the specific use case and the desired trade-off between precision and recall.
My Tensorflow Model:
tokenizer = AutoTokenizer.from_pretrained('prajjwal1/bert-tiny', from_pt = True)
encoder = TFAutoModel.from_pretrained('prajjwal1/bert-tiny', from_pt = True)
# Define input layer with token and attention mask
input_ids = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="input_ids")
attention_mask = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="attention_mask")
# Call the ALBERT model with the inputs
pooler_output = encoder(input_ids, attention_mask=attention_mask)[1] # 1 is pooler output
# Define a dense layer on top of the pooled output
x = tf.keras.layers.Dense(units=params['fc_layer_size'])(pooler_output)
x = tf.keras.layers.Dropout(params['dropout'])(x)
outputs = tf.keras.layers.Dense(4, activation='softmax')(x)
# Define a model with the inputs and dense layer
model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=outputs)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.0008)
metrics = [tf.metrics.SparseCategoricalAccuracy()]
# Compile the model
model.compile(optimizer='sgd', loss=loss, metrics=metrics)
My Pytorch Model:
tokenizer = AutoTokenizer.from_pretrained('prajjwal1/bert-tiny')
encoder = AutoModel.from_pretrained('prajjwal1/bert-tiny')
loss_fn = nn.CrossEntropyLoss()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.0008)
class EscalationClassifier(nn.Module):
def __init__(self, encoder):
super(EscalationClassifier, self).__init__()
self.encoder = encoder
self.fc1 = nn.Linear(128, 312)
self.fc2 = nn.Linear(312, 4)
self.dropout = nn.Dropout(0.2)
def forward(self, input_ids, attention_mask):
pooled_output = self.encoder(input_ids, attention_mask=attention_mask)[1]# [0] is last hidden state, 1 for pooler output
# pdb.set_trace()
x = self.fc1(pooled_output)
x = self.dropout(x)
x = self.fc2(x)
return x
model = EscalationClassifier(encoder)
Can anyone help me explain why my Tensorflow logit outputs aren't more confident like the pytorch outputs? *The problem doesn't seem to be with the softmax itself.

Related

Categorical_crossentropy loss function has value of 0.0000e +00 for a BiLSTM sentiment analysis model

This is the graph of my model
Model
Code format:
def model_creation(vocab_size, embedding_dim, embedding_matrix,
rnn_units, batch_size,
train_embed=False):
model = Sequential(
[
Embedding(vocab_size, embedding_dim,
weights=[embedding_matrix], trainable=train_embed, mask_zero=True),
Bidirectional(LSTM(rnn_units, return_sequences=True, dropout=0.5)),
Bidirectional(LSTM(rnn_units, dropout=0.25)),
Dense(1, activation="softmax")
])
return model
The embedding layer receive an embedding matrix with value from Word2Vec
This is the code for the embedding matrix:
Embedding Matrix
def create_embedding_matrix(encoder,dict_w2v):
embedding_dim = 50
embedding_matrix = np.zeros((encoder.vocab_size, embedding_dim))
for word in encoder.tokens:
embedding_vector = dict_w2v.get(word)
if embedding_vector is not None: # dictionary contains word
test = encoder.encode(word)
token_id = encoder.encode(word)[0]
embedding_matrix[token_id] = embedding_vector
return embedding_matrix
Dataset
I'm using the amazon product dataset https://jmcauley.ucsd.edu/data/amazon/
This is what the dataframe look like
I'm only interested in overall and reviewText
overall is my Label and reviewText is my Feature
overall has a range of [1,5]
Problem
During training with categorical_crossentropy loss the is at 0.0000e +00, I don't think loss can be minimized well so accuracy is always at 0.1172
Did I configure my model wrong or is there any problem? How do I fix my loss function issue ? Please tell me if it's not clear enough I'll provide more information. I'm not sure what the problem is

Completely different results using Tensorflow and Pytorch for MobilenetV3 Small

I am using transfer learning from MobileNetV3 Small to predict 5 different points on an image. I am doing this as a regression task.
For both models:
Setting the last 50 layers trainable and adding the same fully connected layers to the end.
Learning rate 3e-2
Batch size 32
Adam optimizer with the same betas
100 epochs
The inputs consist of RGB unscaled images
Pytorch
Model
def _init_weights(m):
if type(m) == nn.Linear:
nn.init.xavier_uniform_(m.weight)
m.bias.data.fill_(0.01)
def get_mob_v3_small():
model = torchvision.models.mobilenet_v3_small(pretrained=True)
children_list = get_children(model)
for c in children_list[:-50]:
for p in c.parameters():
p.requires_grad = False
return model
class TransferMobileNetV3_v2(nn.Module):
def __init__(self,
num_keypoints: int = 5):
super(TransferMobileNetV3_v2, self).__init__()
self.classifier_neurons = num_keypoints*2
self.base_model = get_mob_v3_small()
self.base_model.classifier = nn.Sequential(
nn.Linear(in_features=1024, out_features=1024),
nn.ReLU(),
nn.Linear(in_features=1024, out_features=512),
nn.ReLU(),
nn.Linear(in_features=512, out_features=self.classifier_neurons)
)
self.base_model.apply(_init_weights)
def forward(self, x):
out = self.base_model(x)
return out
Training Script
def train(net, trainloader, testloader, train_loss_fn, optimizer, scaler, args):
len_dataloader = len(trainloader)
for epoch in range(1, args.epochs+1):
net.train()
for batch_idx, sample in enumerate(trainloader):
inputs, labels = sample
inputs, labels = inputs.to(args.device), labels.to(args.device)
optimizer.zero_grad()
with torch.cuda.amp.autocast(args.use_amp):
prediction = net(inputs)
loss = train_loss_fn(prediction, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
def main():
args = make_args_parser()
args.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
seed = args.seed
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(seed)
loss_fn = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=3e-2,
betas=(0.9, 0.999))
scaler = torch.cuda.amp.GradScaler(enabled=args.use_amp)
train(net, train_loader, test_loader, loss_fn, optimizer, scaler, args)
Tensorflow
Model
base_model = tf.keras.applications.MobileNetV3Small(weights='imagenet',
input_shape=(224,224,3))
x_in = base_model.layers[-6].output
x = Dense(units=1024, activation="relu")(x_in)
x = Dense(units=512, activation="relu")(x)
x = Dense(units=10, activation="linear")(x)
model = Model(inputs=base_model.input, outputs=x)
for layer in model.layers[:-50]:
layer.trainable=False
Training Script
model.compile(loss = "mse",
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-2))
history = model.fit(input_numpy, output_numpy,
verbose=1,
batch_size=32, epochs=100,validation_split = 0.2)
Results
The PyTorch model predicts one single point around the center for all 5 different points.
The Tensorflow model predicts the points quite well and are quite accurate.
The loss in the Pytorch model is much higher than the Tensorflow model.
Please do let me know what is going wrong as I am trying my best to shift to PyTorch for this work and I need this model to give me similar/identical results. Please do let me know what is going wrong as I am trying my best to shift to PyTorch for this work and I need this model to give me similar/identical results.
Note: I also noticed that the MobileNetV3 Small model seems to be different in PyTorch and different in Tensorflow. I do not know if am interpreting it wrong, but I'm putting it here just in case.

How to merge ReLU after quantization aware training

I have a network which contains Conv2D layers followed by ReLU activations, declared as such:
x = layers.Conv2D(self.hparams['channels_count'], kernel_size=(4,1))(x)
x = layers.ReLU()(x)
And it is ported to TFLite with the following representaiton:
Basic TFLite network without Q-aware training
However, after performing quantization-aware training on the network and porting it again, the ReLU layers are now explicit in the graph:
TFLite network after Q-aware training
This results in them being processed separately on the target instead of during the evaluation of the Conv2D kernel, inducing a 10% performance loss in my overall network.
Declaring the activation with the following implicit syntax does not produce the problem:
x = layers.Conv2D(self.hparams['channels_count'], kernel_size=(4,1), activation='relu')(x)
Basic TFLite network with implicit ReLU activation
TFLite network with implicit ReLU after Q-aware training
However, this restricts the network to basic ReLU activation, whereas I would like to use ReLU6 which cannot be declared in this way.
Is this a TFLite issue? If not, is there a way to prevent the ReLU layer from being split? Or alternatively, is there a way to manually merge the ReLU layers back into the Conv2D layers after the quantization-aware training?
Edit:
QA training code:
def learn_qaware(self):
quantize_model = tfmot.quantization.keras.quantize_model
self.model = quantize_model(self.model)
training_generator = SCDataGenerator(self.training_set)
validate_generator = SCDataGenerator(self.validate_set)
self.model.compile(
optimizer=self.configure_optimizers(qa_learn=True),
loss=self.get_LLP_loss(),
metrics=self.get_metrics(),
run_eagerly=config['eager_mode'],
)
self.model.fit(
training_generator,
epochs = self.hparams['max_epochs'],
batch_size = 1,
shuffle = self.hparams['shuffle_curves'],
validation_data = validate_generator,
callbacks = self.get_callbacks(qa_learn=True),
)
Quantized TFLite model generation code:
def tflite_convert(classifier):
output_file = get_tflite_filename(classifier.model_path)
# Convert the model to the TensorFlow Lite format without quantization
saved_shape = classifier.model.input.shape.as_list()
fixed_shape = saved_shape
fixed_shape[0] = 1
classifier.model.input.set_shape(fixed_shape) # Force batch size to 1 for generation
converter = tf.lite.TFLiteConverter.from_keras_model(classifier.model)
classifier.model.input.set_shape(saved_shape)
# Set the optimization flag.
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Enforce integer only quantization
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
# Provide a representative dataset to ensure we quantize correctly.
if config['eager_mode']:
tf.executing_eagerly()
def representative_dataset():
for x in classifier.validate_set.get_all_inputs():
rs = x.reshape(1, x.shape[0], 1, 1).astype(np.float32)
yield([rs])
converter.representative_dataset = representative_dataset
model_tflite = converter.convert()
# Save the model to disk
open(output_file, "wb").write(model_tflite)
return TFLite_model(output_file)
I have found a workaround which works by instantiating a non-trained version of the model, then copying over the weights from the quantization aware trained model before converting to TFLite.
This seems like quite a hack, so I'm still on the lookout for a cleaner solution.
Code for the workaround:
def dequantize(self):
if not hasattr(self, 'fp_model') or not self.fp_model:
self.fp_model = self.get_default_model()
def find_layer_in_model(name, model):
for layer in model.layers:
if layer.name == name:
return layer
return None
def find_weight_group_in_layer(name, layer):
for weight_group in quant_layer.trainable_weights:
if weight_group.name == name:
return weight_group
return None
for layer in self.fp_model.layers:
if 'input' in layer.name or 'quantize_layer' in layer.name:
continue
QUANT_TAG = "quant_"
quant_layer = find_layer_in_model(QUANT_TAG+layer.name,self.model)
if quant_layer is None:
raise RuntimeError('Failed to match layer ' + layer.name)
for i, weight_group in enumerate(layer.trainable_weights):
quant_weight_group = find_weight_group_in_layer(QUANT_TAG+weight_group.name, quant_layer)
if quant_weight_group is None:
quant_weight_group = find_weight_group_in_layer(weight_group.name, quant_layer)
if quant_weight_group is None:
raise RuntimeError('Failed to match weight group ' + weight_group.name)
layer.trainable_weights[i].assign(quant_weight_group)
self.model = self.fp_model
You can pass activation=tf.nn.relu6 to use ReLU6 activation.

How to apply Triplet Loss for a ResNet50 based Siamese Network in Keras or Tf 2

I have a ResNet based siamese network which uses the idea that you try to minimize the l-2 distance between 2 images and then apply a sigmoid so that it gives you {0:'same',1:'different'} output and based on how far the prediction is, you just flow the gradients back to network but there is a problem that updation of gradients is too little as we're changing the distance between {0,1} so I thought of using the same architecture but based on Triplet Loss.
I1 = Input(shape=image_shape)
I2 = Input(shape=image_shape)
res_m_1 = ResNet50(include_top=False, weights='imagenet', input_tensor=I1, pooling='avg')
res_m_2 = ResNet50(include_top=False, weights='imagenet', input_tensor=I2, pooling='avg')
x1 = res_m_1.output
x2 = res_m_2.output
# x = Flatten()(x) or use this one if not using any pooling layer
distance = Lambda( lambda tensors : K.abs( tensors[0] - tensors[1] )) ([x1,x2] )
final_output = Dense(1,activation='sigmoid')(distance)
siamese_model = Model(inputs=[I1,I2], outputs=final_output)
siamese_model.compile(loss='binary_crossentropy',optimizer=Adam(),metrics['acc'])
siamese_model.fit_generator(train_gen,steps_per_epoch=1000,epochs=10,validation_data=validation_data)
So how can I change it to use the Triplet Loss function? What adjustments should be done here in order to get this done? One change will be that I'll have to calculate
res_m_3 = ResNet50(include_top=False, weights='imagenet', input_tensor=I2, pooling='avg')
x3 = res_m_3.output
One thing found in tf docs is triplet-semi-hard-loss and is given as:
tfa.losses.TripletSemiHardLoss()
As shown in the paper, the best results are from triplets known as "Semi-Hard". These are defined as triplets where the negative is farther from the anchor than the positive, but still produces a positive loss. To efficiently find these triplets we utilize online learning and only train from the Semi-Hard examples in each batch.
Another implementation of Triplet Loss which I found on Kaggle is: Triplet Loss Keras
Which one should I use and most importantly, HOW?
P.S: People also use something like: x = Lambda(lambda x: K.l2_normalize(x,axis=1))(x) after model.output. Why is that? What is this doing?
Following this answer of mine, and with role of TripletSemiHardLoss in mind, we could do following:
import tensorflow as tf
import tensorflow_addons as tfa
import tensorflow_datasets as tfds
from tensorflow.keras import models, layers
BATCH_SIZE = 32
LATENT_DEM = 128
def _normalize_img(img, label):
img = tf.cast(img, tf.float32) / 255.
return (img, label)
train_dataset, test_dataset = tfds.load(name="mnist", split=['train', 'test'], as_supervised=True)
# Build your input pipelines
train_dataset = train_dataset.shuffle(1024).batch(BATCH_SIZE)
train_dataset = train_dataset.map(_normalize_img)
test_dataset = test_dataset.batch(BATCH_SIZE)
test_dataset = test_dataset.map(_normalize_img)
inputs = layers.Input(shape=(28, 28, 1))
resNet50 = tf.keras.applications.ResNet50(include_top=False, weights=None, input_tensor=inputs, pooling='avg')
outputs = layers.Dense(LATENT_DEM, activation=None)(resNet50.output) # No activation on final dense layer
outputs = layers.Lambda(lambda x: tf.math.l2_normalize(x, axis=1))(outputs) # L2 normalize embedding
siamese_model = models.Model(inputs=inputs, outputs=outputs)
# Compile the model
siamese_model.compile(
optimizer=tf.keras.optimizers.Adam(0.001),
loss=tfa.losses.TripletSemiHardLoss())
# Train the network
history = siamese_model.fit(
train_dataset,
epochs=3)

Variational Autoencoder in Keras: How to achieve different output of a Keras Layer at the time of training and prediction?

We're implementing a paper titled - "Variational Autoencoders for Collaborative Filtering" in TF 2.0.
The sample implementation of the above paper in TF 1.0 is given here.
The paper proposes an implementation of a Variational Autoencoder for collaborative filtering. As the output of the encoder, it uses the reparametrization trick to sample the latent vector Z at the time of training the network.
The reparametrization trick samples ϵ ∼ N (0, IK) and reparametrize the latent vector Z as:
Zu = µϕ(xu ) + ϵ ⊙ σϕ(xu) where µϕ and σϕ are calculated from the output of the encoder.
But, at the time of prediction, the paper proposes to use only µϕ for sampling Z.
In our implementation, we used a custom tf.keras.layers.Layer to sample the latent vector Z. The following is the code of the architecture:
class Reparameterize(tf.keras.layers.Layer):
"""
Custom layer.
Reparameterization trick, sample random latent vectors Z from
the latent Gaussian distribution.
The sampled vector Z is given by
sampled_z = mean + std * epsilon
"""
def call(self, inputs):
Z_mu, Z_logvar = inputs
Z_sigma = tf.math.exp(0.5 * Z_logvar)
epsilon = tf.random.normal(tf.shape(Z_sigma))
return Z_mu + Z_sigma * epsilon
class VAE:
def __init__(self, input_dim, latent_dim=200):
# encoder
encoder_input = Input(shape=input_dim)
X = tf.math.l2_normalize(encoder_input, 1)
X = Dropout(0.5)(X)
X = Dense(600, activation='tanh')(X)
Z_mu = Dense(latent_dim)(X)
Z_logvar = Dense(latent_dim)(X)
sampled_Z = Reparameterize()([Z_mu, Z_logvar])
# decoder
decoder_input = Input(shape=latent_dim)
X = Dense(600, activation='tanh')(decoder_input)
logits = Dense(input_dim)(X)
# define losses
"""
custom loss function
def loss(X_true, X_pred)
"""
# create models
self.encoder = Model(encoder_input, [Z_logvar, Z_mu, sampled_Z], name='encoder')
self.decoder = Model(decoder_input, logits, name='decoder')
self.vae = Model(encoder_input, self.decoder(sampled_Z), name='vae')
self.vae.add_loss(kl_divergence(Z_logvar, Z_mu))
# compile the model
self.vae.compile(optimizer='adam', loss=loss, metrics=[loss])
Now, I am looking for a way to change the implementation of the custom Reparameterize layer at the time of prediction to use only µϕ (Z_mu) for sampling Z so as to achieve what is proposed by the paper mentioned above.
Or if there's another way of doing so in Tf 2.0, kindly recommend.
You could do:
# create your VAE model
my_vae = VAE(input_dim = my_input_dim)
# Train it as you wish
# .....
When training is done, you could use it as follows:
inp = Input(shape = my_input_dim)
_, Z_mu,_ = my_vae.encoder(inp) # my_vae is your trained model, get its outputs
decoder_output = my_vae.decoder(Z_mu) # use the Z_mu as input to decoder
vae_predictor = Model(inp, decoder_output) # create your prediction time model
You could use the vae_predictor model now for predictions.