Extremely slow when saving model on Colab TPU - tensorflow

my situation is that saving model is extremely slow under Colab TPU environment.
I first encountered this issue when using checkpoint callback, which causes the training stuck at the end of the 1st epoch.
Then, I tried taking out callback and just save the model using model.save_weights(), but nothing has changed. By using Colab terminal, I found that the saving speed is about ~100k for 5 minutes.
The version of Tensorflow = 2.3
My code of model fitting is here:
with tpu_strategy.scope(): # creating the model in the TPUStrategy scope means we will train the model on the TPU
Baseline = create_model()
checkpoint = keras.callbacks.ModelCheckpoint('baseline_{epoch:03d}.h5',
save_weights_only=True, save_freq="epoch")
hist = model.fit(get_train_ds().repeat(),
steps_per_epoch = 100,
epochs = 5,
verbose = 1,
callbacks = [checkpoint])
model.save_weights("epoch-test.h5", overwrite=True)

I found the issue happened because I explicitly switched to graph mode by writing
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()
Before
with tpu_strategy.scope():
model.fit(...)
Though I still don't understand the cause, remove disable_eager_execution solved the issue.

Related

KeyError: 'The optimizer cannot recognize variable dense_1/kernel:0. for pretrained keras model VGG19

I'm using the following code to load an imagenet pre-trained VGG19 model and fit to my custom dataset.
from keras.applications.vgg19 import VGG19
optim = tf.keras.optimizers.RMSprop(momentum=0.9)
vgg19 = VGG19(include_top=False, weights='imagenet', input_tensor=tf.keras.layers.Input(shape=(224, 224, 3)))
vgg19.trainable = False
# x = keras.layers.GlobalAveragePooling2D()(model_vgg19_pt.output)
x = keras.layers.Flatten()(vgg19.output)
output = keras.layers.Dense(n_classes, activation='softmax')(x)
model_vgg19_pt = keras.models.Model(inputs=[vgg19.input], outputs=[output])
model_vgg19_pt.compile(optimizer=optim,
loss='categorical_crossentropy', metrics=['categorical_accuracy'])
callback = tf.keras.callbacks.LearningRateScheduler(scheduler)
model_vgg19_pt.fit(x_train, y_train, batch_size=20,
epochs=50, callbacks=[callback]
)
on model.fit() line, I get the following error
KeyError: 'The optimizer cannot recognize variable dense_1/kernel:0. This usually means you are trying to call the optimizer to update different parts of the model separately. Please call optimizer.build(variables) with the full list of trainable variables before the training loop or use legacy optimizer `tf.keras.optimizers.legacy.{self.class.name}.'
What does it mean and how can I fix it?
I get the same errors for
keras.applications.inception_v3
too, when using the same implementation method.
Additionally, this was working with jupyter notebook file on tensorflow cpu, but when running on a remote machine with tensorflow-gpu installed, I'm getting these errors.
This works fine with optimizer SGD, but not with RMSprop. why?
Additional
Using this:
model_vgg19_pt.compile(optimizer=tf.keras.optimizers.RMSprop(momentum=0.9),
loss='categorical_crossentropy', metrics=['categorical_accuracy'])
instead as used above works. But can somebody explain why....
Which version of Tensorflow GPU have you installed? TensorFlow 2.10 was the last TensorFlow release that supported GPU on native-Windows. Please check the link to install TensorFlow by following all the Hardware/Software requirements for the GPU support.
The LearningRateScheduler arguments in callback is not defined which you are passing while model compilation.
I was able to train the model after removing the callback from model.fit(). (Attaching the gist here for your reference)

I am trying to resume training from a certain checkpoint

I'm facing a problem with restoring training from the last checkpoint that I saved. I'm following exactly this code except that I'm changing the dataset and increasing the number of epochs to 100: Machine Translation French-English notebook
What do I add in order to keep the training because it wouldn't finish in one days and every time it re-starts from epoch 1.
I've found a similar question but the answer didn't solve the problem: Resume training from a certain checkpoint.
I know this is late but I wanted to share the code of a possible solution to this.
Saving a checkpoint and restoring the model from it is pretty easy according to the Tensorflow documentation. The saving can be done using the Tensorflow callbacks every epoch (or with a save_freq additional argument every x epochs):
model.compile(..., metrics=['accuracy'])
EPOCHS = 10
checkpoint_filepath = '/path/to/checkpoint'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_filepath,
save_weights_only=True,
monitor='val_accuracy',
mode='max',
save_best_only=True # if this is not the best epoch so far it is not saved.
)
model.fit(epochs=EPOCHS, callbacks=[model_checkpoint_callback])
Then, before starting a new train, or doing prediction, the weights of the saved checkpoint can be loaded like this:
model.load_weights(checkpoint_filepath)
That's it.

Tensorflow SavedModel ignoring Assets File on Load

I fine-tuned a BERT model from Tensorflow hub to build a simple sentiment analyzer. The model trains and runs fine. On export, I simply used:
tf.saved_model.save(model, export_dir='models')
And this works just fine.. until I reboot.
On a reboot, the model no longer loads. I've tried using a Keras loader as well as the Tensorflow Server, and I get the same error.
I get the following error message:
Not found: /tmp/tfhub_modules/09bd4e665682e6f03bc72fbcff7a68bf879910e/assets/vocab.txt; No such file or directory
The model is trying to load assets from the tfhub modules cache, which is wiped by reboot. I know I could persist the cache, but I don't want to do that because I want to be able to generate models and then copy them over to a separate application without worrying about the cache.
The crux of it is that I don't think it's necessary to look in the cache for the assets at all. The model was saved with an assets folder wherein vocab.txt was generated, so in order to find the assets it just needs to look in its own assets folder (I think). However, it doesn't seem to be doing that.
Is there any way to change this behaviour?
Added the code for building and exporting the model (it's not a clever model, just prototyping my workflow):
bert_model_name = "bert_en_uncased_L-12_H-768_A-12"
BATCH_SIZE = 64
EPOCHS = 1 # Initial
def build_bert_model(bert_model_name):
input_layer = tf.keras.layers.Input(shape=(), dtype=tf.string, name="inputs")
preprocessing_layer = hub.KerasLayer(
map_model_to_preprocess[bert_model_name], name="preprocessing"
)
encoder_inputs = preprocessing_layer(input_layer)
bert_model = hub.KerasLayer(
map_name_to_handle[bert_model_name], name="BERT_encoder"
)
outputs = bert_model(encoder_inputs)
net = outputs["pooled_output"]
net = tf.keras.layers.Dropout(0.1)(net)
net = tf.keras.layers.Dense(1, activation=None, name="classifier")(net)
return tf.keras.Model(input_layer, net)
def main():
train_ds, val_ds = load_sentiment140(batch_size=BATCH_SIZE, epochs=EPOCHS)
steps_per_epoch = tf.data.experimental.cardinality(train_ds).numpy()
init_lr = 3e-5
optimizer = tf.keras.optimizers.Adam(learning_rate=init_lr)
model = build_bert_model(bert_model_name)
model.compile(optimizer=optimizer, loss='mse', metrics='mse')
model.fit(train_ds, validation_data=val_ds, steps_per_epoch=steps_per_epoch)
tf.saved_model.save(model, export_dir='models')
This problem comes from a TensorFlow bug triggered by versions /1 and /2 of https://tfhub.dev/tensorflow/bert_en_uncased_preprocess. The updated models tensorflow/bert_*_preprocess/3 (released last Friday) avoid this bug. Please update to the newest version.
The Classify Text with BERT tutorial has been updated accordingly.
Thanks for bringing this up!

How to save model weights in tf keras when training on TPUs?

I want to save weights of a model I trained on kaggle's TPUs, but I'm getting an error. Its a known bug and one work around is saving the weights to cpu from the TPU. Is there any other workaround this bug?
From Tensorflow 2.3 and tf-nightly you can use tf.saved_model.LoadOptions with tf.saved_model.LoadOptions="/job:localhost" to save/restorer a TPU model on local file system:
model_dir = "./mnist_model"
localhost_save_option = tf.saved_model.SaveOptions(experimental_io_device="/job:localhost")
model.save(model_dir, options=localhost_save_option)
# Restore the weights
model2 = tf.keras.models.load_model(model_dir, options=localhost_save_option)

Tensorflow: How can I save a checkpoint only if the error is minimized during training?

I am running a tensorflow program and I want to store the best model for later use. I am using estimator (tf.contrib.tpu.TPUEstimator module that takes a run_config argument, where I set save_checkpoints_secs=20*60) for training.
estimator.train takes a train_input_fn and num_train_steps as arguments.
eg: estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
Instead of saving the checkpoint after every 'n' seconds, I want to store the best model which has minimal error while training.
Any help is welcomed.
tf.estimator.BestExporter seems like it's exactly what you're looking for. According to the documentation, it states:
This class performs a model export every time when the new model is
better than any existing model.
estimator = tf.estimator.DNNClassifier(
config=tf.estimator.RunConfig(
model_dir='/my_model', save_summary_steps=100),
feature_columns=[categorial_feature_a_emb, ...],
hidden_units=[1024, 512, 256])
serving_feature_spec = tf.feature_column.make_parse_example_spec(
categorial_feature_a_emb)
serving_input_receiver_fn = (
tf.estimator.export.build_parsing_serving_input_receiver_fn(
serving_feature_spec))
exporter = tf.estimator.BestExporter(
name="best_exporter",
serving_input_receiver_fn=serving_input_receiver_fn,
exports_to_keep=5)
train_spec = tf.estimator.TrainSpec(...)
eval_spec = [tf.estimator.EvalSpec(
input_fn=eval_input_fn,
steps=100,
exporters=exporter,
start_delay_secs=0,
throttle_secs=5)]