I've noticed that the new Estimator API automatically saves checkpoints during the training and automatically restarts from the last checkpoint when training was interrupted. Unfortunately, it seems it only keeps the last 5 checkpoints.
Do you know how to control the number of checkpoints that are kept during the training?
Tensorflow tf.estimator.Estimator takes config as an optional argument, which can be a tf.estimator.RunConfig object to configure runtime settings.You can achieve this as follows:
# Change maximum number checkpoints to 25
run_config = tf.estimator.RunConfig()
run_config = run_config.replace(keep_checkpoint_max=25)
# Build your estimator
estimator = tf.estimator.Estimator(model_fn,
model_dir=job_dir,
config=run_config,
params=None)
config parameter is available in all classes (DNNClassifier, DNNLinearCombinedClassifier, LinearClassifier, etc.) that extend estimator.Estimator.
As a side note I would like to add that in TensorfFlow2 the situation is a little bit simpler. To keep a certain number of checkpoint files you can modify the model_main_tf2.py source code. First you can add and define an integer flag as
# Keep last 25 checkpoints
flags.DEFINE_integer('checkpoint_max_to_keep', 25,
'Integer defining how many checkpoint files to keep.')
Then use this pre-defined value in a call to model_lib_v2.train_loop:
# Ensure training loop keeps last 25 checkpoints
model_lib_v2.train_loop(...,
checkpoint_max_to_keep=FLAGS.checkpoint_max_to_keep,
...)
The symbol ... above denotes other options to model_lib_v2.train_loop.
Related
I'm using a pre-trained model in TensorFlow 2 Detection Model Zoo, for Object Detection, within CoLab (Tensorflow v2.7.0).
The (new) dataset consists of 255 images for training. The train_config > batch_size in pipeline.config is 8. So I intend to make a checkpoint every one epoch (thus, checkpoint_every_n: 255/8= ~32), and will train in 100 epochs; thus, num_train_steps is 3200. As a result, I assume there will be 100 checkpoint files generated.
!python model_main_tf2.py \
--pipeline_config_path="./models/pipeline.config" \
--model_dir="./models" \
--checkpoint_every_n=32 \
--num_train_steps=3200 \
--alsologtostderr
However, there are only 7 checkpoint files after the training. Here is the snapshot from the tool tree /F on Windows command line.
Did I miss something (e.g. an additional configuration somewhere)? Is my above assumption correct? Or this is simply a bug?
In the file model_main_tf2.py, the main loop is:
with strategy.scope():
model_lib_v2.train_loop(
pipeline_config_path=FLAGS.pipeline_config_path,
model_dir=FLAGS.model_dir,
train_steps=FLAGS.num_train_steps,
use_tpu=FLAGS.use_tpu,
checkpoint_every_n=FLAGS.checkpoint_every_n,
record_summaries=FLAGS.record_summaries)
Checking model_lib_v2.train_loop() (link), there is a default argument:
def train_loop(
pipeline_config_path,
model_dir,
config_override=None,
train_steps=None,
use_tpu=False,
save_final_config=False,
checkpoint_every_n=1000,
checkpoint_max_to_keep=7, # Here!
record_summaries=True,
That reasons why there are only 7 Checkpoint files generated. They should be the last ones.
In addition, the argument checkpoint_every_n is not strictly respected. It is affected by NUM_STEPS_PER_ITERATION, which is hardcoded with 100 (can't be changed from outsite). In file model_lib_v2.py, function train_loop(), line:
if ((int(global_step.value()) - checkpointed_step) >=
checkpoint_every_n):
manager.save() # Here!
checkpointed_step = int(global_step.value())
...meaning: In every iteration, it moves 100 steps, and I indicated num_train_steps as 3200; so the Checkpoint should be made for every 100 steps, ending up 32 files. Plus one Checkpoint file at the very beginning (line), we end up with ckpt-33 as showed.
I cannot load model weights after saving them in TensorFlow 2.2. Weights appear to be saved correctly (I think), however, I fail to load the pre-trained model.
My current code is:
segmentor = sequential_model_1()
discriminator = sequential_model_2()
def save_model(ckp_dir):
# create directory, if it does not exist:
utils.safe_mkdir(ckp_dir)
# save weights
segmentor.save_weights(os.path.join(ckp_dir, 'checkpoint-segmentor'))
discriminator.save_weights(os.path.join(ckp_dir, 'checkpoint-discriminator'))
def load_pretrained_model(ckp_dir):
try:
segmentor.load_weights(os.path.join(ckp_dir, 'checkpoint-segmentor'), skip_mismatch=True)
discriminator.load_weights(os.path.join(ckp_dir, 'checkpoint-discriminator'), skip_mismatch=True)
print('Loading pre-trained model from: {0}'.format(ckp_dir))
except ValueError:
print('No pre-trained model available.')
Then I have the training loop:
# training loop:
for epoch in range(num_epochs):
for image, label in dataset:
train_step()
# save best model I find during training:
if this_is_the_best_model_on_validation_set():
save_model(ckp_dir='logs_dir')
And then, at the end of the training "for loop", I want to load the best model and do a test with it. Hence, I run:
# load saved model and do a test:
load_pretrained_model(ckp_dir='logs_dir')
test()
However, this results in a ValueError. I checked the directory where the weights should be saved, and there they are!
Any idea what is wrong with my code? Am I loading the weights incorrectly?
Thank you!
Ok here is your problem - the try-except block you have is obscuring the real issue. Removing it gives the ValueError:
ValueError: When calling model.load_weights, skip_mismatch can only be set to True when by_name is True.
There are two ways to mitigate this - you can either call load_weights with by_name=True, or remove skip_mismatch=True depending on your needs. Either case works for me when testing your code.
Another consideration is that you when you store both the discriminator and segmentor checkpoints to the log directory, you overwrite the checkpoint file each time. This contains two strings that give the path to the specific model checkpoint files. Since you save discriminator second, every time this file will say discriminator with no reference to segmentor. You can mitigate this by storing each model in two subdirectories in the log directory instead, i.e.
logs_dir/
+ discriminator/
+ checkpoint
+ ...
+ segmentor/
+ checkpoint
+ ...
Although in the current state your code would work in this case.
I'm writing a process-based implementation of a3c with tensorflow in eager mode. After every gradient update, my general model writes its parameters as checkpoints to a folder. The workers then update their parameters by loading the last checkpoints from this folder. However, there is a problem.
Often times, while the worker is reading the last available checkpoint from the folder, the master network will write new checkpoints to the folder and sometimes will erase the checkpoint that the worker is reading. A simple solution would be raising the maximum of checkpoints to keep. However, tfe.Checkpoint and tfe.Saver don't have a parameter to choose the max to keep.
Is there a way to achieve this?
For the tf.train.Saver you can specify max_to_keep:
tf.train.Saver(max_to_keep = 10)
and max_to_keep seems to be present in the both fte.Saver and it's tf.training.Saver.
I haven't tried if it works though.
It seems the suggested way of doing checkpoint deletion is to use the CheckpointManager.
import tensorflow as tf
checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)
manager = tf.contrib.checkpoint.CheckpointManager(
checkpoint, directory="/tmp/model", max_to_keep=5)
status = checkpoint.restore(manager.latest_checkpoint)
while True:
# train
manager.save()
We can save many checkpoints of model using Estimator and RunConfig.
classifier eval will use the latest step 200 by default,
can I load ckpt-1?
my_checkpointing_config = tf.estimator.RunConfig(
save_checkpoints_secs = 20*60, # Save checkpoints every 20 minutes.
keep_checkpoint_max = 10, # Retain the 10 most recent checkpoints.
)
classifier = tf.estimator.DNNClassifier(
feature_columns=my_feature_columns,
hidden_units=[10, 10],
n_classes=3,
model_dir='models/iris',
config=my_checkpointing_config)
$ ls -1 models/iris
checkpoint
events.out.tfevents.timestamp.hostname
graph.pbtxt
model.ckpt-1.data-00000-of-00001
model.ckpt-1.index
model.ckpt-1.meta
model.ckpt-200.data-00000-of-00001
model.ckpt-200.index
model.ckpt-200.meta
Both tf.estimator.Estimator.evaluate and tf.estimator.Estimator.predict have a checkpoint_path argument. You should be able to supply the path to model.ckpt-1 here to use this checkpoint for evaluation.
Note that this argument was added in a fairly recent TF update (might be 1.7 or 1.8, not quire sure) so if you are using an outdated version you might not have this argument available. There is a hacky alternative: In the model_dir there should be a file called checkpoint. The first line of this file should be
model_checkpoint_path: "model.ckpt-xxxxxx"
where xxxxxx is the number of steps for the latest checkpoint (200 in your case). You can manually change this line to whatever checkpoint you want the Estimator to load. However you will probably want to change it back afterwards or you might run into issues if you ever want to continue training the model.
Is it possible to only load specific layers (convolutional layers) out of one checkpoint file?
I've trained some CNNs fully-supervised and saved my progress (I'm doing object localization). To do auto-labelling I thought of building a weakly-supervised CNNs out of my current model...but since the weakly-supervised version has different fully-connected layers, I would like to select only the convolutional filters of my TensorFlow checkpoint file.
Of course I could manually save the weights of the corresponding layers, but due to the fact that they're already included in TensorFlow's checkpoint file I would like to extract them there, in order to have one single storing file.
TensorFlow 2.1 has many different public facilities for loading checkpoints (model.save, Checkpoint, saved_model, etc), but to the best of my knowledge, none of them has filtering API. So, let me suggest a snippet for hard cases which uses tooling from the TF2.1 internal development tests.
checkpoint_filename = '/path/to/our/weird/checkpoint.ckpt'
model = tf.keras.Model( ... ) # TF2.0 Model to initialize with the above checkpoint
variables_to_load = [ ... ] # List of model weight names to update.
from tensorflow.python.training.checkpoint_utils import load_checkpoint, list_variables
reader = load_checkpoint(checkpoint_filename)
for w in model.weights:
name=w.name.split(':')[0] # See (b/29227106)
if name in variables_to_load:
print(f"Updating {name}")
w.assign(reader.get_tensor(
# (Optional) Handle variable renaming
{'/var_name1/in/model':'/var_name1/in/checkpoint',
'/var_name2/in/model':'/var_name2/in/checkpoint',
# ... and so on
}.get(name,name)))
Note: model.weights and list_variables may help to inspect variables in Model and in the checkpoint
Note also, that this method will not restore model's optimizer state.