I'm trying to train a Keras model and save model weighta at every epoch and patch.
I define a checkpoint as follows:
checkpoint_path='model_checkpoints_5000/checkpoints_{epoch:02d}_{batch:04d}'
checkpoint = ModelCheckpoint(filepath = checkpoint_5000_path,frequency = 5000)
and train the model:
model.fit(x=x_train, y=y_train, epochs=3, validation_data=(x_test, y_test),
batch_size=10, callbacks=[checkpoint])
But right after the forst iteration the error occurs:
KeyError: 'Failed to format this callback filepath: "model_checkpoints_5000/checkpoints_{epoch:02d}_{batch:04d}". Reason: \'batch\'
How can I have python add the batch number to the file name?
How to fide the list of other parameters that are available for output?
My setup: Windows 10, jupyter notebook in chrome, Python 3.5.4, Tensorflow 2.3.0, Keras is imported from Tensorflow.
Related
I'm using the following code to load an imagenet pre-trained VGG19 model and fit to my custom dataset.
from keras.applications.vgg19 import VGG19
optim = tf.keras.optimizers.RMSprop(momentum=0.9)
vgg19 = VGG19(include_top=False, weights='imagenet', input_tensor=tf.keras.layers.Input(shape=(224, 224, 3)))
vgg19.trainable = False
# x = keras.layers.GlobalAveragePooling2D()(model_vgg19_pt.output)
x = keras.layers.Flatten()(vgg19.output)
output = keras.layers.Dense(n_classes, activation='softmax')(x)
model_vgg19_pt = keras.models.Model(inputs=[vgg19.input], outputs=[output])
model_vgg19_pt.compile(optimizer=optim,
loss='categorical_crossentropy', metrics=['categorical_accuracy'])
callback = tf.keras.callbacks.LearningRateScheduler(scheduler)
model_vgg19_pt.fit(x_train, y_train, batch_size=20,
epochs=50, callbacks=[callback]
)
on model.fit() line, I get the following error
KeyError: 'The optimizer cannot recognize variable dense_1/kernel:0. This usually means you are trying to call the optimizer to update different parts of the model separately. Please call optimizer.build(variables) with the full list of trainable variables before the training loop or use legacy optimizer `tf.keras.optimizers.legacy.{self.class.name}.'
What does it mean and how can I fix it?
I get the same errors for
keras.applications.inception_v3
too, when using the same implementation method.
Additionally, this was working with jupyter notebook file on tensorflow cpu, but when running on a remote machine with tensorflow-gpu installed, I'm getting these errors.
This works fine with optimizer SGD, but not with RMSprop. why?
Additional
Using this:
model_vgg19_pt.compile(optimizer=tf.keras.optimizers.RMSprop(momentum=0.9),
loss='categorical_crossentropy', metrics=['categorical_accuracy'])
instead as used above works. But can somebody explain why....
Which version of Tensorflow GPU have you installed? TensorFlow 2.10 was the last TensorFlow release that supported GPU on native-Windows. Please check the link to install TensorFlow by following all the Hardware/Software requirements for the GPU support.
The LearningRateScheduler arguments in callback is not defined which you are passing while model compilation.
I was able to train the model after removing the callback from model.fit(). (Attaching the gist here for your reference)
I've been training a model which looks a bit like:
base_model = tf.keras.applications.ResNet50(weights=weights, include_top=False, input_tensor=input_tensor)
for layer in base_model.layers:
layer.trainable = False
x = tf.keras.layers.GlobalMaxPool2D()(base_model.output)
output = tf.keras.Sequential()
output.add(tf.keras.layers.Dense(2, activation='linear'))
output.add(tf.keras.layers.Dense(2, activation='linear'))
output.add(tf.keras.layers.Dense(2, activation='linear'))
output.add(tf.keras.layers.Dense(2, activation='linear'))
output.add(tf.keras.layers.Dense(2, activation='linear'))
return output(x)
I setup checkpoints saving with code like:
cp_callback = tf.keras.callbacks.ModelCheckpoint(
filepath=checkpoint_path,
verbose=1,
save_weights_only=True,
save_freq=batch_size*5)
Yesterday I started a fit to run for 11 epochs. I'm not sure why, but the machine restarted during the 7th epoch. Naturally I want to resume fitting from the start of epoch 7.
The checkpoint code above created three files:
The contents of checkpoint are:
model_checkpoint_path: "checkpoint"
all_model_checkpoint_paths: "checkpoint"
The other two files are binary. I tried to load the checkpoint weights with both:
model.load_weights('./2022-03-16_21-10/checkpoints/checkpoint.data-00000-of-00001')
model.load_weights('./2022-03-16_21-10/checkpoints/')
Both fail with NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files.
How can I restore this checkpoint and as a result resume fitting?
I'm using tensorflow 2.4.
These might help: Training checkpoints and tf.train.Checkpoint. According to the documentation, you should be able to load the model using something like this:
model = tf.keras.Model(...)
checkpoint = tf.train.Checkpoint(model)
# Restore the checkpointed values to the `model` object.
checkpoint.restore(save_path)
I am not sure it will work if the checkpoint contains other variables. You might have to use checkpoint.restore(path).expect_partial().
You can also check the content that has been saved (according to the documentation) by Manually inspecting checkpoints :
reader = tf.train.load_checkpoint('./tf_ckpts/')
shape_from_key = reader.get_variable_to_shape_map()
dtype_from_key = reader.get_variable_to_dtype_map()
sorted(shape_from_key.keys())
I'm trying to use a model from tensorflow hub on Kaggle.
Like so:
m = tf.keras.Sequential([
hub.KerasLayer("https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/4", output_shape=[1280],
trainable=False), # Can be True, see below.
tf.keras.layers.Dense(num_classes, activation='softmax')
])
m.build([None, 224, 224, 3]) # Batch input shape.
It works well with GPU, but as soon as I switch to TPU with TF records I get the following error:
InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /tmp/tfhub_modules/87fb99f72aec02d017e12c0a3d86c5c182ec22ca/variables/variables: Unimplemented: File system scheme '[local]' not implemented (file: '/tmp/tfhub_modules/87fb99f72aec02d017e12c0a3d86c5c182ec22ca/variables/variables')
However the set up and tfrecords dataset are all correct as it works with a switching the pretrained model to a keras application of the same model (i.e. for example above using the mobilenet keras application).
I tried caching but I have been unsuccessful, is there something I have to beware when following this guide:
https://www.tensorflow.org/hub/caching
Thanks in advance!
The failure happens because the TPU is trying to load the TFHub model from /tmp/ which it doesn't have access to. You should be able to get this to work by:
with strategy.scope():
load_locally = tf.saved_model.LoadOptions(experimental_io_device='/job:localhost')
m = tf.keras.Sequential([
hub.KerasLayer(
"https://tfhub.dev/google/tf2-preview/mobilenet_v2/feature_vector/4",
output_shape=[1280],
load_options=load_locally,
trainable=False), # Can be True, see below.
tf.keras.layers.Dense(num_classes, activation='softmax')
])
Source: EfficientNetB7 on 100+ flowers.
Created a simple dummy sequential model in tf.keras as shown below:
model = tf.keras.Sequential()
model.add(layers.Dense(10, input_shape=(100, 100)))
model.add(layers.Conv1D(3, 2))
model.add(layers.Flatten())
model.add(layers.Dense(10, activation='softmax', name='predict_10'))
Trained the model and saved it using tf.keras.models.saved_model.
To get the input input and output node names used saved_model_cli.
saved_model_cli show --dir "path/to/SavedModel" --all
Froze the saved model with freeze_graph.py utility.
python freeze_graph.py --input_saved_model_dir=<path/to/SavedModel> --output_graph=<path/freeze.pb> --input_binary=True --output_node_names=StatefulPartitionedCall
Model is frozen.
Now Here's the main issue:
To load the frozen graph I've used this guide Migrate tf1.x to tf2.x (wrap_frozen_graph)
Used
with tf.io.gfile.GFile("patf/to/freeze.pb", 'rb') as f:
graph_def = tf.compat.v1.GraphDef()
graph_def.ParseFromString(f.read())
load_frozen = wrap_frozen_graph(graph_def, inputs='dense_3_input:0', outputs='predict_10:0')
Output error
ValueError: Input 1 of node StatefulPartitionedCall was passed float from dense_3/kernel:0 incompatible with expected resource.
I'm getting same error when converting .pb to .dlc (Qualcomm).
Actually I want to run original model on Qualcomm's Hexagon DSP or GPU.
Checkpoint snippet:
checkpointer = ModelCheckpoint(filepath=os.path.join(savedir, "mid/weights.{epoch:02d}.hd5"), monitor='val_loss', verbose=1, save_best_only=False, save_weights_only=False)
hist = model.fit_generator(
gen.generate(batch_size = batch_size, nb_classes=nb_classes), samples_per_epoch=593920, nb_epoch=nb_epoch, verbose=1, callbacks=[checkpointer], validation_data = gen.vld_generate(VLD_PATH, batch_size = 64, nb_classes=nb_classes), nb_val_samples=10000
)
I trained my model on a multiple GPU host which dumps mid files in HDF5 format. When I loaded them on a single GPU machine with keras.load_weights('mid'), an error was raised:
Using TensorFlow backend.
Traceback (most recent call last):
File "server.py", line 171, in <module>
model = load_model_and_weights('zhch.yml', '7_weights.52.hd5')
File "server.py", line 16, in load_model_and_weights
model.load_weights(os.path.join('model', weights_name))
File "/home/lz/code/ProjectGo/meta/project/libpolicy-server/.virtualenv/lib/python3.5/site-packages/keras/engine/topology.py", line 2701, in load_weights
self.load_weights_from_hdf5_group(f)
File "/home/lz/code/ProjectGo/meta/project/libpolicy-server/.virtualenv/lib/python3.5/site-packages/keras/engine/topology.py", line 2753, in load_weights_from_hdf5_group
str(len(flattened_layers)) + ' layers.')
ValueError: You are trying to load a weight file containing 1 layers into a model with 21 layers.
Is there any way to load checkpoint weights generated by multiple GPUs on a single GPU machine? It seems that no issue of Keras discussed this problem thus any help would be appreciated.
You can load your model on a single GPU like this:
from keras.models import load_model
multi_gpus_model = load_model('mid')
origin_model = multi_gpus_model.layers[-2] # you can use multi_gpus_model.summary() to see the layer of the original model
origin_model.save_weights('single_gpu_model.hdf5')
'single_gpu_model.hdf5' is the file that you can load to the single GPU machine model.
Try this function:
def keras_model_reassign_weights(model_cpu,model_gpu):
weights_temp ={}
print('_'*5,'Collecting weights from GPU model','_'*5)
for layer in model_gpu.layers:
try:
for layer_unw in layer.layers:
#print('Weights extracted for: ',layer_unw.name)
weights_temp[layer_unw.name] = layer_unw.get_weights()
break
except:
print('Skipped: ',layer.name)
print('_'*5,'Writing weights to CPU model','_'*5)
for layer in model_cpu.layers:
try:
layer.set_weights(weights_temp[layer.name])
#print(layer.name,'Done!')
except:
print(layer.name,'weights does not set for this layer!')
return model_cpu
But you need to load weights to your gpu model first:
#load or initialize your keras multi-gpu model
model_gpu = None
#load or initialize your keras model with the same structure, without using keras.multi_gpu function
model_cpu = None
#load weights into multigpu model
model_gpu.load_weights(r'gpu_model_best_checkpoint.hdf5')
#execute function
model_cpu = keras_model_reassign_weights(model_cpu,model_gpu)
#save obtained weights for cpu model
model_cpu.save_weights(r'CPU_model.hdf5')
After transferring you can use weights with a single GPU or CPU model.