How to merge ReLU after quantization aware training - tensorflow

I have a network which contains Conv2D layers followed by ReLU activations, declared as such:
x = layers.Conv2D(self.hparams['channels_count'], kernel_size=(4,1))(x)
x = layers.ReLU()(x)
And it is ported to TFLite with the following representaiton:
Basic TFLite network without Q-aware training
However, after performing quantization-aware training on the network and porting it again, the ReLU layers are now explicit in the graph:
TFLite network after Q-aware training
This results in them being processed separately on the target instead of during the evaluation of the Conv2D kernel, inducing a 10% performance loss in my overall network.
Declaring the activation with the following implicit syntax does not produce the problem:
x = layers.Conv2D(self.hparams['channels_count'], kernel_size=(4,1), activation='relu')(x)
Basic TFLite network with implicit ReLU activation
TFLite network with implicit ReLU after Q-aware training
However, this restricts the network to basic ReLU activation, whereas I would like to use ReLU6 which cannot be declared in this way.
Is this a TFLite issue? If not, is there a way to prevent the ReLU layer from being split? Or alternatively, is there a way to manually merge the ReLU layers back into the Conv2D layers after the quantization-aware training?
Edit:
QA training code:
def learn_qaware(self):
quantize_model = tfmot.quantization.keras.quantize_model
self.model = quantize_model(self.model)
training_generator = SCDataGenerator(self.training_set)
validate_generator = SCDataGenerator(self.validate_set)
self.model.compile(
optimizer=self.configure_optimizers(qa_learn=True),
loss=self.get_LLP_loss(),
metrics=self.get_metrics(),
run_eagerly=config['eager_mode'],
)
self.model.fit(
training_generator,
epochs = self.hparams['max_epochs'],
batch_size = 1,
shuffle = self.hparams['shuffle_curves'],
validation_data = validate_generator,
callbacks = self.get_callbacks(qa_learn=True),
)
Quantized TFLite model generation code:
def tflite_convert(classifier):
output_file = get_tflite_filename(classifier.model_path)
# Convert the model to the TensorFlow Lite format without quantization
saved_shape = classifier.model.input.shape.as_list()
fixed_shape = saved_shape
fixed_shape[0] = 1
classifier.model.input.set_shape(fixed_shape) # Force batch size to 1 for generation
converter = tf.lite.TFLiteConverter.from_keras_model(classifier.model)
classifier.model.input.set_shape(saved_shape)
# Set the optimization flag.
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Enforce integer only quantization
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
# Provide a representative dataset to ensure we quantize correctly.
if config['eager_mode']:
tf.executing_eagerly()
def representative_dataset():
for x in classifier.validate_set.get_all_inputs():
rs = x.reshape(1, x.shape[0], 1, 1).astype(np.float32)
yield([rs])
converter.representative_dataset = representative_dataset
model_tflite = converter.convert()
# Save the model to disk
open(output_file, "wb").write(model_tflite)
return TFLite_model(output_file)

I have found a workaround which works by instantiating a non-trained version of the model, then copying over the weights from the quantization aware trained model before converting to TFLite.
This seems like quite a hack, so I'm still on the lookout for a cleaner solution.
Code for the workaround:
def dequantize(self):
if not hasattr(self, 'fp_model') or not self.fp_model:
self.fp_model = self.get_default_model()
def find_layer_in_model(name, model):
for layer in model.layers:
if layer.name == name:
return layer
return None
def find_weight_group_in_layer(name, layer):
for weight_group in quant_layer.trainable_weights:
if weight_group.name == name:
return weight_group
return None
for layer in self.fp_model.layers:
if 'input' in layer.name or 'quantize_layer' in layer.name:
continue
QUANT_TAG = "quant_"
quant_layer = find_layer_in_model(QUANT_TAG+layer.name,self.model)
if quant_layer is None:
raise RuntimeError('Failed to match layer ' + layer.name)
for i, weight_group in enumerate(layer.trainable_weights):
quant_weight_group = find_weight_group_in_layer(QUANT_TAG+weight_group.name, quant_layer)
if quant_weight_group is None:
quant_weight_group = find_weight_group_in_layer(weight_group.name, quant_layer)
if quant_weight_group is None:
raise RuntimeError('Failed to match weight group ' + weight_group.name)
layer.trainable_weights[i].assign(quant_weight_group)
self.model = self.fp_model

You can pass activation=tf.nn.relu6 to use ReLU6 activation.

Related

Output logits with softmax aren't extreme when using Tensorflow. No prediction is very confident

I trained a text classification model on data in TensorFlow and plotted the SoftMax confidence for the correct prediction as well as the SoftMax confidence for incorrect predictions. When I did this I noticed that there were no output predictions with a high logit/class probability. For example, predicting between 4 classes had these results:
(TensorFlow version)
input text: "Text that fits into class 0"
logits: [.3928, 0.2365, 0.1854, 0.1854]
label: class 0
I would hope that the logit output for class 0 would be higher than .3928! Looking at the graph you can see that none of the prediction logits output a number higher than (.5).
Next, I retrained the exact same model and dataset but in Pytorch. With Pytorch, I got the results I was looking for. Both models had the exact same validation accuracy after training. (90% val accuracy)
(Pytorch Version)
input text: "Text that fits into class 0"
logits: [0.8778, 0.0532, 0.0056, 0.0635]
label: class 0
Here is what I understand about the softmax function:
The softmax function transforms its inputs, which can be seen as logits, into a probability distribution over the classes. The maximum value of the softmax is 1, so if your largest logit is 0.5, then it means that the highest probability assigned by the softmax will be relatively low (less than 1/2).
To have more extreme outputs, you need to have larger logits. One way to do this is to train your model with more data and a more powerful architecture, so that it can learn more complex relationships between inputs and outputs. It may not be desirable for a model to have extremely high confidence in its predictions, as it could lead to overfitting to the training data. The appropriate level of confidence will depend on the specific use case and the desired trade-off between precision and recall.
My Tensorflow Model:
tokenizer = AutoTokenizer.from_pretrained('prajjwal1/bert-tiny', from_pt = True)
encoder = TFAutoModel.from_pretrained('prajjwal1/bert-tiny', from_pt = True)
# Define input layer with token and attention mask
input_ids = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="input_ids")
attention_mask = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="attention_mask")
# Call the ALBERT model with the inputs
pooler_output = encoder(input_ids, attention_mask=attention_mask)[1] # 1 is pooler output
# Define a dense layer on top of the pooled output
x = tf.keras.layers.Dense(units=params['fc_layer_size'])(pooler_output)
x = tf.keras.layers.Dropout(params['dropout'])(x)
outputs = tf.keras.layers.Dense(4, activation='softmax')(x)
# Define a model with the inputs and dense layer
model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=outputs)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.0008)
metrics = [tf.metrics.SparseCategoricalAccuracy()]
# Compile the model
model.compile(optimizer='sgd', loss=loss, metrics=metrics)
My Pytorch Model:
tokenizer = AutoTokenizer.from_pretrained('prajjwal1/bert-tiny')
encoder = AutoModel.from_pretrained('prajjwal1/bert-tiny')
loss_fn = nn.CrossEntropyLoss()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.0008)
class EscalationClassifier(nn.Module):
def __init__(self, encoder):
super(EscalationClassifier, self).__init__()
self.encoder = encoder
self.fc1 = nn.Linear(128, 312)
self.fc2 = nn.Linear(312, 4)
self.dropout = nn.Dropout(0.2)
def forward(self, input_ids, attention_mask):
pooled_output = self.encoder(input_ids, attention_mask=attention_mask)[1]# [0] is last hidden state, 1 for pooler output
# pdb.set_trace()
x = self.fc1(pooled_output)
x = self.dropout(x)
x = self.fc2(x)
return x
model = EscalationClassifier(encoder)
Can anyone help me explain why my Tensorflow logit outputs aren't more confident like the pytorch outputs? *The problem doesn't seem to be with the softmax itself.

How to train Tensorflow's pre trained BERT on MLM task? ( Use pre-trained model only in Tensorflow)

Before Anyone suggests pytorch and other things, I am looking specifically for Tensorflow + pretrained + MLM task only. I know, there are lots of blogs for PyTorch and lots of blogs for fine tuning ( Classification) on Tensorflow.
Coming to the problem, I got a language model which is English + LaTex where a text data can represent any text from Physics, Chemistry, MAths and Biology and any typical example can look something like this:
Link to OCR image
"Find the value of function x in the equation: \n \\( f(x)=\\left\\{\\begin{array}{ll}x^{2} & \\text { if } x<0 \\\\ 2 x & \\text { if } x \\geq 0\\end{array}\\right. \\)"
So my language model needs to understand \geq \\begin array \eng \left \right other than the English language and that is why I need to train an MLM first on pre-trained BERT or SciBERT to have both. So I went up digging the internet and found some tutorials:
MLM training on Tensorflow BUT from Scratch; I need pre-trained
MLM on pre-trained but in Pytorch; I need Tensorflow
Fine Tuning with Keras; It is for classification but I want MLM
I already have a fine tuning classification model. Some of the code is as follows:
tokenizer = transformers.BertTokenizer.from_pretrained('bert-large-uncased')
def regular_encode(texts, tokenizer, maxlen=maxlen):
enc_di = tokenizer.batch_encode_plus(texts, return_token_type_ids=False,padding='max_length',max_length=maxlen,truncation = True,)
return np.array(enc_di['input_ids'])
Xtrain_encoded = regular_encode(X_train.astype('str'), tokenizer, maxlen=maxlen)
ytrain_encoded = tf.keras.utils.to_categorical(y_train, num_classes=classes,dtype = 'int32')
def build_model(transformer, loss='categorical_crossentropy', max_len=maxlen, dense = 512, drop1 = 0.3, drop2 = 0.3):
input_word_ids = tf.keras.layers.Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
sequence_output = transformer(input_word_ids)[0]
cls_token = sequence_output[:, 0, :]
#Fine Tuning Model Start
x = tf.keras.layers.Dropout(drop1)(cls_token)
x = tf.keras.layers.Dense(512, activation='relu')(x)
x = tf.keras.layers.Dropout(drop2)(x)
out = tf.keras.layers.Dense(classes, activation='softmax')(x)
model = tf.keras.Model(inputs=input_word_ids, outputs=out)
return model
Only useful thing I could get was in the HuggingFace that
With the tight interoperability between TensorFlow and PyTorch models, you can even save the model and then reload it as a PyTorch model (or vice-versa)
from transformers import AutoModelForSequenceClassification
model.save_pretrained("my_imdb_model")
pytorch_model = AutoModelForSequenceClassification.from_pretrained("my_imdb_model", from_tf=True)
So maybe I could train a pytorch MLM and then load it as a tensorflow fine tuned classification model?
Is there any other way?
I think after some research, I found something. I don't know if it'll work but it uses transformer and Tensorflow with XLM . I think it'll work for BERT too.
PRETRAINED_MODEL = 'jplu/tf-xlm-roberta-large'
from tensorflow.keras.optimizers import Adam
import transformers
from transformers import TFAutoModelWithLMHead, AutoTokenizer
def create_mlm_model_and_optimizer():
with strategy.scope():
model = TFAutoModelWithLMHead.from_pretrained(PRETRAINED_MODEL)
optimizer = tf.keras.optimizers.Adam(learning_rate=LR)
return model, optimizer
mlm_model, optimizer = create_mlm_model_and_optimizer()
def define_mlm_loss_and_metrics():
with strategy.scope():
mlm_loss_object = masked_sparse_categorical_crossentropy
def compute_mlm_loss(labels, predictions):
per_example_loss = mlm_loss_object(labels, predictions)
loss = tf.nn.compute_average_loss(
per_example_loss, global_batch_size = global_batch_size)
return loss
train_mlm_loss_metric = tf.keras.metrics.Mean()
return compute_mlm_loss, train_mlm_loss_metric
def masked_sparse_categorical_crossentropy(y_true, y_pred):
y_true_masked = tf.boolean_mask(y_true, tf.not_equal(y_true, -1))
y_pred_masked = tf.boolean_mask(y_pred, tf.not_equal(y_true, -1))
loss = tf.keras.losses.sparse_categorical_crossentropy(y_true_masked,
y_pred_masked,
from_logits=True)
return loss
def train_mlm(train_dist_dataset, total_steps=2000, evaluate_every=200):
step = 0
### Training lopp ###
for tensor in train_dist_dataset:
distributed_mlm_train_step(tensor)
step+=1
if (step % evaluate_every == 0):
### Print train metrics ###
train_metric = train_mlm_loss_metric.result().numpy()
print("Step %d, train loss: %.2f" % (step, train_metric))
### Reset metrics ###
train_mlm_loss_metric.reset_states()
if step == total_steps:
break
#tf.function
def distributed_mlm_train_step(data):
strategy.experimental_run_v2(mlm_train_step, args=(data,))
#tf.function
def mlm_train_step(inputs):
features, labels = inputs
with tf.GradientTape() as tape:
predictions = mlm_model(features, training=True)[0]
loss = compute_mlm_loss(labels, predictions)
gradients = tape.gradient(loss, mlm_model.trainable_variables)
optimizer.apply_gradients(zip(gradients, mlm_model.trainable_variables))
train_mlm_loss_metric.update_state(loss)
compute_mlm_loss, train_mlm_loss_metric = define_mlm_loss_and_metrics()
Now train it as train_mlm(train_dist_dataset, TOTAL_STEPS, EVALUATE_EVERY)
Above ode is from this notebook and you need to do all the necessary things given exactly
The author says in the end that:
This fine tuned model can be loaded just as the original to build a classification model from it

How to save Tensorflow encoder decoder model?

I followed this tutorial about building an encoder-decoder language translation model and built one for my native language.
Now I want to save it, deploy on cloud ML engine and make predictions with HTTP request.
I couldn't find a clear example on how to save this model,
I am new to ML and found TF save guide v confusing..
Is there a way to save this model using something like
tf.keras.models.save_model
Create the train saver after opening the session and after the training is done save the model:
with tf.Session() as sess:
saver = tf.train.Saver()
# Training of the model
save_path = saver.save(sess, "logs/encoder_decoder")
print(f"Model saved in path {save_path}")
You can save a Keras model in Keras's HDF5 format, see:
https://keras.io/getting-started/faq/#how-can-i-save-a-keras-model
You will want to do something like:
import tf.keras
model = tf.keras.Model(blah blah)
model.save('my_model.h5')
If you migrate to TF 2.0, it's more straightforward to build a model in tf.keras and deploy using the TF SavedModel format. This 2.0 tutorial shows using a pretrained tf.keras model, saving the model in SavedModel format, deploying to the cloud and then doing an HTTP request for a prediction:
https://www.tensorflow.org/beta/guide/saved_model
I know I am a little late but was having the same problem (see How do I save an encoder-decoder model with TensorFlow? for more details) and figured out a solution. It's a little hacky, but it works!
Step 1 - Saving your model
Save your tokenizer (if applicable). Then individually save the weights of the model you used to train your data (naming your layers helps here).
# Save the tokenizer
with open('tokenizer.pickle', 'wb') as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
# save the weights individually
for layer in model.layers:
weights = layer.get_weights()
if weights != []:
np.savez(f'{layer.name}.npz', weights)
Step 2 - reloading the weights
You will want to reload the tokenizer (as applicable) then load the weights you just saved. The loaded weights are in an npz format so can't be used directly, but the very short documentation will tell you everything you need to know about this file type https://numpy.org/doc/stable/reference/generated/numpy.savez.html
# load the tokenizer
with open('tokenizer.pickle', 'rb') as handle:
tokenizer = pickle.load(handle)
# load the weights
w_encoder_embeddings = np.load('encoder_embeddings.npz', allow_pickle=True)
w_decoder_embeddings = np.load('decoder_embeddings.npz', allow_pickle=True)
w_encoder_lstm = np.load('encoder_lstm.npz', allow_pickle=True)
w_decoder_lstm = np.load('decoder_lstm.npz', allow_pickle=True)
w_dense = np.load('dense.npz', allow_pickle=True)
Step 3 - Recreate the your training model and apply the weights
You'll want to re-run the code you used to create your model. In my case this was:
encoder_inputs = Input(shape=(None,), name="encoder_inputs")
encoder_embeddings = Embedding(vocab_size, embedding_size, mask_zero=True, name="encoder_embeddings")(encoder_inputs)
# Encoder lstm
encoder_lstm = LSTM(512, return_state=True, name="encoder_lstm")
encoder_outputs, state_h, state_c = encoder_lstm(encoder_embeddings)
# discard `encoder_outputs` and only keep the states.
encoder_states = [state_h, state_c]
# Set up the decoder, using `encoder_states` as initial state.
decoder_inputs = Input(shape=(None,), name="decoder_inputs")
# target word embeddings
decoder_embeddings = Embedding(vocab_size, embedding_size, mask_zero=True, name="decoder_embeddings")
training_decoder_embeddings = decoder_embeddings(decoder_inputs)
# decoder lstm
decoder_lstm = LSTM(512, return_sequences=True, return_state=True, name="decoder_lstm")
decoder_outputs, _, _ = decoder_lstm(training_decoder_embeddings,
initial_state=encoder_states)
decoder_dense = TimeDistributed(Dense(vocab_size, activation='softmax'), name="dense")
decoder_outputs = decoder_dense(decoder_outputs)
# While training, model takes input and traget words and outputs target strings
loaded_model = Model([encoder_inputs, decoder_inputs], decoder_outputs, name="training_model")
Now you can apply your saved weights to these layers! It takes a little bit of investigation which weight goes to which layer, but this is made a lot easier by naming your layers and inspecting your model layers with model.layers.
# set the weights of the model
loaded_model.layers[2].set_weights(w_encoder_embeddings['arr_0'])
loaded_model.layers[3].set_weights(w_decoder_embeddings['arr_0'])
loaded_model.layers[4].set_weights(w_encoder_lstm['arr_0'])
loaded_model.layers[5].set_weights(w_decoder_lstm['arr_0'])
loaded_model.layers[6].set_weights(w_dense['arr_0'])
Step 4 - Create the inference model
Finally, you can now create your inference model based on this training model! Again in my case this was:
encoder_model = Model(encoder_inputs, encoder_states)
# Redefine the decoder model with decoder will be getting below inputs from encoder while in prediction
decoder_state_input_h = Input(shape=(512,))
decoder_state_input_c = Input(shape=(512,))
decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
inference_decoder_embeddings = decoder_embeddings(decoder_inputs)
decoder_outputs2, state_h2, state_c2 = decoder_lstm(inference_decoder_embeddings, initial_state=decoder_states_inputs)
decoder_states2 = [state_h2, state_c2]
decoder_outputs2 = decoder_dense(decoder_outputs2)
# sampling model will take encoder states and decoder_input(seed initially) and output the predictions(french word index) We dont care about decoder_states2
decoder_model = Model(
[decoder_inputs] + decoder_states_inputs,
[decoder_outputs2] + decoder_states2)
And voilĂ ! You can now make inferences using the previously trained model!

Tensorflow dense layers worse than keras sequential

I try to train an agent on the inverse-pendulum (similar to cart-pole) problem, which is a benchmark of reinforcement learning. I use neural-fitted-Q-iteration algorithm which uses a multi-layer neural network to evaluate the Q function.
I use Keras.Sequential and tf.layers.dense to build the neural network repectively, and leave all other things to be the same. However, Keras gives me a good results and tensorflow does not. In fact, tensorflow doesn't work at all with its loss being increasing and the agent learns nothing from the training.
Here I present the code for Keras as follows
def build_model():
model = Sequential()
model.add(Dense(5, input_dim=3))
model.add(Activation('sigmoid'))
model.add(Dense(5))
model.add(Activation('sigmoid'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
adam = Adam(lr=1E-3)
model.compile(loss='mean_squared_error', optimizer=adam)
return model
and the tensorflow version is
class NFQ_fit(object):
"""
neural network approximator for NFQ iteration
"""
def __init__(self, sess, N_feature, learning_rate=1E-3, batch_size=100):
self.sess = sess
self.N_feature = N_feature
self.learning_rate = learning_rate
self.batch_size = batch_size
# DNN structure
self.inputs = tf.placeholder(tf.float32, [None, N_feature], 'inputs')
self.labels = tf.placeholder(tf.float32, [None, 1], 'labels')
self.l1 = tf.layers.dense(inputs=self.inputs,
units=5,
activation=tf.sigmoid,
use_bias=True,
kernel_initializer=tf.truncated_normal_initializer(0.0, 1E-2),
bias_initializer=tf.constant_initializer(0.0),
kernel_regularizer=tf.contrib.layers.l2_regularizer(1E-4),
name='hidden-layer-1')
self.l2 = tf.layers.dense(inputs=self.l1,
units=5,
activation=tf.sigmoid,
use_bias=True,
kernel_initializer=tf.truncated_normal_initializer(0.0, 1E-2),
bias_initializer=tf.constant_initializer(0.0),
kernel_regularizer=tf.contrib.layers.l2_regularizer(1E-4),
name='hidden-layer-2')
self.outputs = tf.layers.dense(inputs=self.l2,
units=1,
activation=tf.sigmoid,
use_bias=True,
kernel_initializer=tf.truncated_normal_initializer(0.0, 1E-2),
bias_initializer=tf.constant_initializer(0.0),
kernel_regularizer=tf.contrib.layers.l2_regularizer(1E-4),
name='outputs')
# optimization
# self.mean_loss = tf.losses.mean_squared_error(self.labels, self.outputs)
self.mean_loss = tf.reduce_mean(tf.square(self.labels-self.outputs))
self.regularization_loss = tf.losses.get_regularization_loss()
self.loss = self.mean_loss # + self.regularization_loss
self.train_op = tf.train.AdamOptimizer(learning_rate=self.learning_rate).minimize(self.loss)
The two models are the same. Both of them has two hidden layers with the same dimension. I expect that the problems may come from the kernel initialization but I don't know how to fix it.
Using Keras is great. If you want better TensorFlow integration check out tf.keras. There's no particular reason to use tf.layers if the Keras (or tf.keras) defaults work better.
In this case glorot_uniform looks like the default initializer. This is also the global TensorFlow default, so consider removing the kernel_initializer argument instead of the explicit truncated normal initialization in your question (or passing Glorot explicitly).

ValueError: No gradients provided for any variable when tensorflow operations added on keras output

I have a pre-trained Keras Sequential model called agent, and I'm trying to fine-tune it with a loss function.
json_file = open('model/prior_model_RMSprop.json', 'r')
json_model = json_file.read()
json_file.close()
agent = model_from_json(json_model)
prior = model_from_json(json_model)
# load weights into model
agent.load_weights('model/model_RMSprop.h5')
prior.load_weights('model/model_RMSprop.h5')
agent_output = agent.output
prior_output = prior.output
loss = tf.reduce_mean(tf.square(agent_output - prior_output))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
So far, everything works fine. However, when I add some basic tensorflow operations, the error happens
agent_logits = tf.cast(tf.argmax(agent_output, axis = 2), dtype = tf.float32)
prior_logits = tf.cast(tf.argmax(prior_output, axis = 2), dtype = tf.float32)
loss = tf.reduce_mean(tf.square(agent_logits - prior_logits))
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
ValueError: No gradients provided for any variable
So the tensorflow operations break the connection between the model and the loss function? I've been stucked here for almost 2 weeks so pls help. I'm also not very clear about how to update a Keras model's trainable weights with the loss function I defined. Any hints or related links will be appreciated!!!