tensorflow: load checkpoint - tensorflow

I've been training a model which looks a bit like:
base_model = tf.keras.applications.ResNet50(weights=weights, include_top=False, input_tensor=input_tensor)
for layer in base_model.layers:
layer.trainable = False
x = tf.keras.layers.GlobalMaxPool2D()(base_model.output)
output = tf.keras.Sequential()
output.add(tf.keras.layers.Dense(2, activation='linear'))
output.add(tf.keras.layers.Dense(2, activation='linear'))
output.add(tf.keras.layers.Dense(2, activation='linear'))
output.add(tf.keras.layers.Dense(2, activation='linear'))
output.add(tf.keras.layers.Dense(2, activation='linear'))
return output(x)
I setup checkpoints saving with code like:
cp_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_path,
verbose=1,
save_weights_only=True,
save_freq=batch_size*5)
Yesterday I started a fit to run for 11 epochs. I'm not sure why, but the machine restarted during the 7th epoch. Naturally I want to resume fitting from the start of epoch 7.
The checkpoint code above created three files:
The contents of checkpoint are:
model_checkpoint_path: "checkpoint"
all_model_checkpoint_paths: "checkpoint"
The other two files are binary. I tried to load the checkpoint weights with both:
model.load_weights('./2022-03-16_21-10/checkpoints/checkpoint.data-00000-of-00001')
model.load_weights('./2022-03-16_21-10/checkpoints/')
Both fail with NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files.
How can I restore this checkpoint and as a result resume fitting?
I'm using tensorflow 2.4.

These might help: Training checkpoints and tf.train.Checkpoint. According to the documentation, you should be able to load the model using something like this:
model = tf.keras.Model(...)
checkpoint = tf.train.Checkpoint(model)
# Restore the checkpointed values to the `model` object.
checkpoint.restore(save_path)
I am not sure it will work if the checkpoint contains other variables. You might have to use checkpoint.restore(path).expect_partial().
You can also check the content that has been saved (according to the documentation) by Manually inspecting checkpoints :
reader = tf.train.load_checkpoint('./tf_ckpts/')
shape_from_key = reader.get_variable_to_shape_map()
dtype_from_key = reader.get_variable_to_dtype_map()
sorted(shape_from_key.keys())

Related

How can I use Tensorflow.Checkpoint to recover a previously trained net

I'm trying to understand how to recover a saved/checkpointed net using tensorflow.train.Checkpoint.restore.
I'm using code that's strongly based on Google's Colab tutorial for creating a pix2pix GAN. Below, I've excerpted the key portion, which just attempts to instantiate a new net, then to fill it with weights from a previous net that was saved and checkpointed.
I'm assigning a unique(ish) id number to a particular instantiation of a net by summing all the weights of the net. I compare these id numbers both at the creation of the net, and after I've attempted to recover the checkpointed net
def main(opt):
# Initialize pix2pix GAN using arguments input from command line
p2p = Pix2Pix(vars(opt))
print(opt)
# print sum of initial weights for net
print("Init Model Weights:",
sum([x.numpy().sum() for x in p2p.generator.weights]))
# Create or read from model checkpoints
checkpoint = tf.train.Checkpoint(generator_optimizer=p2p.generator_optimizer,
discriminator_optimizer=p2p.discriminator_optimizer,
generator=p2p.generator,
discriminator=p2p.discriminator)
# print sum of weights from checkpoint, to ensure it has access
# to relevant regions of p2p
print("Checkpoint Weights:",
sum([x.numpy().sum() for x in checkpoint.generator.weights]))
# Recover Checkpointed net
checkpoint.restore(tf.train.latest_checkpoint(opt.weights)).expect_partial()
# print sum of weights for p2p & checkpoint after attempting to restore saved net
print("Restore Model Weights:",
sum([x.numpy().sum() for x in p2p.generator.weights]))
print("Restored Checkpoint Weights:",
sum([x.numpy().sum() for x in checkpoint.generator.weights]))
print("Done.")
if __name__ == '__main__':
opt = parse_opt()
main(opt)
The output I got when I ran this code was as follows:
Namespace(channels='1', data='data', img_size=256, output='output', weights='weights/ckpt-40.data-00000-of-00001')
## These are the input arguments, the images have only 1 channel (they're gray scale)
## The directory with data is ./data, the images are 265x256
## The output directory is ./output
## The checkpointed net is stored in ./weights/ckpt-40.data-00000-of-00001
## Sums of nets' weights
Init Model Weights: 11047.206374436617
Checkpoint Weights: 11047.206374436617
Restore Model Weights: 11047.206374436617
Restored Checkpoint Weights: 11047.206374436617
Done.
There is no change in the sum of the net's weights before and after recovering the checkpointed version, although p2p and checkpoint do seem to have access to the same locations in memory.
Why am I not recovering the saved net?
The problem arose because tf.Checkpoint.restore needs the directory in which the checkpointed net is stored, not the specific file (or, what I took to be the specific file - ./weights/ckpt-40.data-00000-of-00001)
When it is not given a valid directory, it silently proceeds to the next line of code, without updating the net or throwing an error. The fix was to give it the directory with the relevant checkpoint files, rather than just the file I believed to be relevant.
My alternative way is using callback and restore, you may name the layer for checkpoints they determine.
Example:
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: DataSet
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
DATA = adding_array_DATA(DATA, action, reward, gamescores, step)
dataset = tf.data.Dataset.from_tensor_slices((tf.constant(DATA, dtype=tf.float32),tf.constant(np.reshape(0, (1, 1, 1, 1)))))
batched_features = dataset
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Model Initialize
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
model = tf.keras.models.Sequential([
tf.keras.layers.InputLayer(input_shape=(1200, 1)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128, return_sequences=True, return_state=False)),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(128)),
])
model.add(layers.Flatten())
model.add(layers.Dense(64))
model.add(layers.Dense(2))
model.summary()
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Callback
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
cp_callback = tf.keras.callbacks.ModelCheckpoint(checkpoint_dir, monitor='val_loss',
verbose=0, save_best_only=True, mode='min' )
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Optimizer
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
optimizer = tf.keras.optimizers.Nadam(
learning_rate=0.0001, beta_1=0.9, beta_2=0.999, epsilon=1e-07,
name='Nadam'
) # 0.00001
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Loss Fn
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
# 1
lossfn = tf.keras.losses.MeanSquaredLogarithmicError(reduction=tf.keras.losses.Reduction.AUTO, name='mean_squared_logarithmic_error')
# 2
# lossfn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Model Summary
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
model.compile(optimizer=optimizer, loss=lossfn, metrics=['accuracy'])
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
: Training
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""
history = model.fit(batched_features, epochs=1 ,validation_data=(batched_features), callbacks=[cp_callback]) # epochs=500 # , callbacks=[cp_callback, tb_callback]
checkpoint = tf.train.Checkpoint(model)
checkpoint.restore(checkpoint_dir)
input('...')
Output:
2022-03-08 10:33:06.965274: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8100
1/1 [==============================] - ETA: 0s - **loss: 0.0154** - accuracy: 0.0000e+002022-03-08 10:33:16.175845: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
1/1 [==============================] - 31s 31s/step - **loss: 0.0154** - accuracy: 0.0000e+00 - val_loss: 0.0074 - val_accuracy: 0.0000e+00
...

What parameters can be used in filepath for ModelCheckpoints (tf.Keras)?

I'm trying to train a Keras model and save model weighta at every epoch and patch.
I define a checkpoint as follows:
checkpoint_path='model_checkpoints_5000/checkpoints_{epoch:02d}_{batch:04d}'
checkpoint = ModelCheckpoint(filepath = checkpoint_5000_path,frequency = 5000)
and train the model:
model.fit(x=x_train, y=y_train, epochs=3, validation_data=(x_test, y_test),
batch_size=10, callbacks=[checkpoint])
But right after the forst iteration the error occurs:
KeyError: 'Failed to format this callback filepath: "model_checkpoints_5000/checkpoints_{epoch:02d}_{batch:04d}". Reason: \'batch\'
How can I have python add the batch number to the file name?
How to fide the list of other parameters that are available for output?
My setup: Windows 10, jupyter notebook in chrome, Python 3.5.4, Tensorflow 2.3.0, Keras is imported from Tensorflow.

Import Weights from Keras Classifier into TF Object Detection API

I have a classifier that I trained using keras that is working really well. It uses keras.applications.MobileNetV2.
This classifier is well trained on around 200 categories, and has a high accuracy.
However, I would like to use the feature extraction layers from this classifier as part of an object detection model.
I have been using the Tensorflow Object Detection API, and looking into the SSDLite+MobileNetV2 model. I can start to run training, but the training is very slow and the bulk of the loss comes from the classification stage.
What I would like to do is assign the weights from my keras .h5 model to the Feature Extraction layer of MobileNetV2 in Tensorflow, but I'm not sure of the best way to do that.
I can load the h5 file easily, and get a list of layer names:
import keras
keras_model = keras.models.load_model("my_classifier.h5")
keras_names = [l.name for l in keras_model.layers]
print(keras_names)
I can also restore the tensorflow checkpoint from the object detection API and export the layers with weights:
tf.reset_default_graph()
with tf.Session() as sess:
new_saver = tf.train.import_meta_graph('models/model.ckpt.meta')
what = new_saver.restore(sess, 'models/model.ckpt')
tf_names = []
for op in sess.graph.get_operations():
if "MobilenetV2" in op.name and "Assign" in op.name:
tf_names.append(op.name)
print(tf_names)
I cannot seem to get a good match-up between layer names from keras and from tensorflow. Even if I could I'm not sure of the next steps.
If anyone could give me some advice about the best way to approach this I would be very grateful.
Update:
I followed Sharky's suggestion below, with a slight modification:
new_saver = tf.train.import_meta_graph(os.path.join(keras_checkpoint_dir, 'keras_model.ckpt.meta'))
new_saver.restore(sess, os.path.join(keras_checkpoint_dir, tf.train.latest_checkpoint(keras_checkpoint_dir)))
However unfortunately I now get this error:
NotFoundError (see above for traceback): Restoring from checkpoint
failed. This is most likely due to a Variable name or other graph key
that is missing from the checkpoint. Please ensure that you have not
altered the graph expected based on the checkpoint. Original error:
Key
FeatureExtractor/MobilenetV2/expanded_conv_6/project/BatchNorm/gamma
not found in checkpoint [[node save/RestoreV2_295 (defined at
:7) = RestoreV2[dtypes=[DT_FLOAT],
_device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0,
save/RestoreV2_295/tensor_names,
save/RestoreV2_295/shape_and_slices)]] [[{{node
save/RestoreV2_196/_393}} = _Recvclient_terminated=false,
recv_device="/job:localhost/replica:0/task:0/device:GPU:0",
send_device="/job:localhost/replica:0/task:0/device:CPU:0",
send_device_incarnation=1, tensor_name="edge_789_save/RestoreV2_196",
tensor_type=DT_FLOAT,
_device="/job:localhost/replica:0/task:0/device:GPU:0"]]
Any ideas on how to get rid of this error?
You can use tf.keras.estimator.model_to_estimator
estimator = tf.keras.estimator.model_to_estimator(keras_model=model, model_dir=path)
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, os.path.join(path/keras, tf.train.latest_checkpoint(path/keras)))
print(tf.global_variables())
This should do the job. Note that it will create a subdirectory inside originally specified path.

Error when implementing tensorflow high level api

I am trying to implement tensorflows provided high level api's, specifically the baseline classifier. However when trying to train the model, I get the following
Error:
NotFoundError (see above for traceback): Key baseline/bias not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
Code:
import tensorflow as tf
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
def digit_cross():
# Number of classes, one class for each of 10 digits.
num_classes = 10
digit = datasets.load_digits()
x = digit.data
y = digit.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.3, random_state=42)
y_train_index = np.arange(y_train.size)
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array(x_train)},
y=np.array(y_train),
num_epochs=None,
shuffle=False)
# Build BaselineClassifier
classifier = tf.estimator.BaselineClassifier(n_classes=num_classes,
model_dir="./checkpoints_tutorial17-1/")
# Fit model.
classifier.train(train_input_fn)
digit_cross()
It seems that you have a checkpoint in model_dir="./checkpoints_tutorial17-1/", which is from another model and is not from a BaselineClassifier. To be specific, you have a checkpoint file and model.ckpt-* files in that folder.
As tensorflow documented:
model_dir: Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into a estimator to continue training a previously saved model. If PathLike object, the path will be resolved. If None, the model_dir in config will be used if set. If both are set, they must be same. If both are None, a temporary directory will be used.
Here, BaselineClassifier will first build a graph which uses baseline/bias. Then it finds out that there is a previous checkpoint in model_dir. It will try to load this checkpoint and you should see an info (if you've done tf.logging.set_verbosity(tf.logging.INFO)) saying something like
"INFO:tensorflow:Restoring parameters from .../checkpoints_tutorial17-1\model.ckpt-..."
Because this checkpoint in model_dir is not from a BaselineClassifier, it won't have baseline/bias. BaselineClassifier cannot find it and will thus throw an error.

Keras: Load checkpoint weights HDF5 generated by multiple GPUs

Checkpoint snippet:
checkpointer = ModelCheckpoint(filepath=os.path.join(savedir, "mid/weights.{epoch:02d}.hd5"), monitor='val_loss', verbose=1, save_best_only=False, save_weights_only=False)
hist = model.fit_generator(
gen.generate(batch_size = batch_size, nb_classes=nb_classes), samples_per_epoch=593920, nb_epoch=nb_epoch, verbose=1, callbacks=[checkpointer], validation_data = gen.vld_generate(VLD_PATH, batch_size = 64, nb_classes=nb_classes), nb_val_samples=10000
)
I trained my model on a multiple GPU host which dumps mid files in HDF5 format. When I loaded them on a single GPU machine with keras.load_weights('mid'), an error was raised:
Using TensorFlow backend.
Traceback (most recent call last):
File "server.py", line 171, in <module>
model = load_model_and_weights('zhch.yml', '7_weights.52.hd5')
File "server.py", line 16, in load_model_and_weights
model.load_weights(os.path.join('model', weights_name))
File "/home/lz/code/ProjectGo/meta/project/libpolicy-server/.virtualenv/lib/python3.5/site-packages/keras/engine/topology.py", line 2701, in load_weights
self.load_weights_from_hdf5_group(f)
File "/home/lz/code/ProjectGo/meta/project/libpolicy-server/.virtualenv/lib/python3.5/site-packages/keras/engine/topology.py", line 2753, in load_weights_from_hdf5_group
str(len(flattened_layers)) + ' layers.')
ValueError: You are trying to load a weight file containing 1 layers into a model with 21 layers.
Is there any way to load checkpoint weights generated by multiple GPUs on a single GPU machine? It seems that no issue of Keras discussed this problem thus any help would be appreciated.
You can load your model on a single GPU like this:
from keras.models import load_model
multi_gpus_model = load_model('mid')
origin_model = multi_gpus_model.layers[-2] # you can use multi_gpus_model.summary() to see the layer of the original model
origin_model.save_weights('single_gpu_model.hdf5')
'single_gpu_model.hdf5' is the file that you can load to the single GPU machine model.
Try this function:
def keras_model_reassign_weights(model_cpu,model_gpu):
weights_temp ={}
print('_'*5,'Collecting weights from GPU model','_'*5)
for layer in model_gpu.layers:
try:
for layer_unw in layer.layers:
#print('Weights extracted for: ',layer_unw.name)
weights_temp[layer_unw.name] = layer_unw.get_weights()
break
except:
print('Skipped: ',layer.name)
print('_'*5,'Writing weights to CPU model','_'*5)
for layer in model_cpu.layers:
try:
layer.set_weights(weights_temp[layer.name])
#print(layer.name,'Done!')
except:
print(layer.name,'weights does not set for this layer!')
return model_cpu
But you need to load weights to your gpu model first:
#load or initialize your keras multi-gpu model
model_gpu = None
#load or initialize your keras model with the same structure, without using keras.multi_gpu function
model_cpu = None
#load weights into multigpu model
model_gpu.load_weights(r'gpu_model_best_checkpoint.hdf5')
#execute function
model_cpu = keras_model_reassign_weights(model_cpu,model_gpu)
#save obtained weights for cpu model
model_cpu.save_weights(r'CPU_model.hdf5')
After transferring you can use weights with a single GPU or CPU model.