How to set the checkpoint for fine-tuning - tensorflow

I found the loss when I retrain the model(ssd_mobilenetv2) from model_zoo is very large at the begining of training, While the accuracy on validation_set is good. Training log as below:
The log couldn't be from the trained model. I doubt it doesn't load the checkpoint to do the fine-tune. Please help me how to do the fine-tune with the trained model on the same dataset. I didn't modify the network structure at all.
I set the checkpoint path in pipeline.config as below:
fine_tune_checkpoint:"//ssd_mobilenet_v2_coco_2018_03_29/model.ckpt"
If I set the model_dir as my downloaded directory, It wouldn't train since the global_train_step is larger than max_step. Then I enlarge the max_step, I can see the log of restoring the parameter from checkpoint. But it would meet error that couldn't restore some parameter.
So I set the model_dir to a empty directory. It could train normally but the loss in step0 would be very large. And the validation result is very bad
in pipeline.config
fine_tune_checkpoint: "/ssd_mobilenet_v2_coco_2018_03_29/model.ckpt"
num_steps: 200000
fine_tune_checkpoint_type: "detection"
train script
model_dir = '/ssd_mobilenet_v2_coco_2018_03_29/retrain0524
pipeline_config_path = '/ssd_mobilenet_v2_coco_2018_03_29/pipeline.config'
checkpoint_dir = '/ssd_mobilenet_v2_coco_2018_03_29/model.ckpt'
num_train_steps = 300000
config = tf.estimator.RunConfig(model_dir=model_dir)
train_and_eval_dict = model_lib.create_estimator_and_inputs(
run_config=config,
hparams=model_hparams.create_hparams(hparams_overrides),
pipeline_config_path=pipeline_config_path,
sample_1_of_n_eval_examples=sample_1_of_n_eval_examples,
sample_1_of_n_eval_on_train_examples=(sample_1_of_n_eval_on_train_examples))
estimator = train_and_eval_dict['estimator']
train_input_fn = train_and_eval_dict['train_input_fn']
eval_input_fns = train_and_eval_dict['eval_input_fns']
eval_on_train_input_fn = train_and_eval_dict['eval_on_train_input_fn']
predict_input_fn = train_and_eval_dict['predict_input_fn']
train_steps = train_and_eval_dict['train_steps']
train_spec, eval_specs = model_lib.create_train_and_eval_specs(
train_input_fn,
eval_input_fns,
eval_on_train_input_fn,
predict_input_fn,
train_steps,
eval_on_train_data=False)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
INFO:tensorflow:loss = 356.25497, step = 0
INFO:tensorflow:global_step/sec: 1.89768
INFO:tensorflow:loss = 11.221423, step = 100 (52.700 sec)
INFO:tensorflow:global_step/sec: 2.21685
INFO:tensorflow:loss = 10.329516, step = 200 (45.109 sec)

If the initial training loss is 400, the model most likely is restored from a checkpoint successfully, just not all the same as the checkpoint.
Here is the restore_map function of ssd models, note that even if you set fine_tune_checkpoint_type : detection and even provided with exactly the same model's checkpoint, still only the variables in the feature_extractor scope are restored. To restore as much variables from the checkpoint as possible, you will have to set load_all_detection_checkpoint_vars: true in your config file.
def restore_map(self,
fine_tune_checkpoint_type='detection',
load_all_detection_checkpoint_vars=False):
if fine_tune_checkpoint_type not in ['detection', 'classification']:
raise ValueError('Not supported fine_tune_checkpoint_type: {}'.format(
fine_tune_checkpoint_type))
if fine_tune_checkpoint_type == 'classification':
return self._feature_extractor.restore_from_classification_checkpoint_fn(
self._extract_features_scope)
if fine_tune_checkpoint_type == 'detection':
variables_to_restore = {}
for variable in tf.global_variables():
var_name = variable.op.name
if load_all_detection_checkpoint_vars:
variables_to_restore[var_name] = variable
else:
if var_name.startswith(self._extract_features_scope):
variables_to_restore[var_name] = variable
return variables_to_restore

Related

How to load the last checkpoint in TensorFlow?

I am practising with TensorFlow on this tutorial. The evaluate function depends on the training to load the latest checkpoint:
checkpoint_path = "./checkpoints/train"
ckpt = tf.train.Checkpoint(encoder=encoder,
decoder=decoder,
optimizer = optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
start_epoch = 0
if ckpt_manager.latest_checkpoint:
start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])
ckpt.restore(ckpt_manager.latest_checkpoint)
for epoch in range(start_epoch, EPOCHS):
start = time.time()
total_loss = 0
for (batch, (img_tensor, target)) in enumerate(dataset):
batch_loss, t_loss = train_step(img_tensor, target)
total_loss += t_loss
if batch % 100 == 0:
print ('Epoch {} Batch {} Loss {:.4f}'.format(
epoch + 1, batch, batch_loss.numpy() / int(target.shape[1])))
loss_plot.append(total_loss / num_steps)
ckpt_manager.save()
Without ckpt_manager.save(), the evaluation function does not work.
When we have already trained a model and the checkpoints are available in checkpoint_path. How should we load the model without training?
You can use tf.train.latest_checkpoint to get the latest checkpoint file and then load it manually using ckpt.restore:
checkpoint_path = "./checkpoints/train"
ckpt = tf.train.Checkpoint(encoder=encoder,
decoder=decoder,
ckpt_path = tf.train.latest_checkpoint(checkpoint_path)
ckpt.restore(ckpt_path)

using Estimator interface for inference with pre-trained tensorflow object detection model

I'm trying to load a pre-trained tensorflow object detection model from the Tensorflow Object Detection repo as a tf.estimator.Estimator and use it to make predictions.
I'm able to load the model and run inference using Estimator.predict(), however the output is garbage. Other methods of loading the model, e.g. as a Predictor, and running inference work fine.
Any help properly loading a model as an Estimator calling predict() would be much appreciated. My current code:
Load and prepare image
def load_image_into_numpy_array(image):
(im_width, im_height) = image.size
return np.array(list(image.getdata())).reshape((im_height, im_width, 3)).astype(np.uint8)
image_url = 'https://i.imgur.com/rRHusZq.jpg'
# Load image
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))
# Format original image size
im_size_orig = np.array(list(image.size) + [1])
im_size_orig = np.expand_dims(im_size_orig, axis=0)
im_size_orig = np.int32(im_size_orig)
# Resize image
image = image.resize((np.array(image.size) / 4).astype(int))
# Format image
image_np = load_image_into_numpy_array(image)
image_np_expanded = np.expand_dims(image_np, axis=0)
image_np_expanded = np.float32(image_np_expanded)
# Stick into feature dict
x = {'image': image_np_expanded, 'true_image_shape': im_size_orig}
# Stick into input function
predict_input_fn = tf.estimator.inputs.numpy_input_fn(
x=x,
y=None,
shuffle=False,
batch_size=128,
queue_capacity=1000,
num_epochs=1,
num_threads=1,
)
Side note:
train_and_eval_dict also seems to contain an input_fn for prediction
train_and_eval_dict['predict_input_fn']
However this actually returns a tf.estimator.export.ServingInputReceiver, which I'm not sure what to do with. This could potentially be the source of my problems as there's a fair bit of pre-processing involved before the model actually sees the image.
Load model as Estimator
Model downloaded from TF Model Zoo here, code to load model adapted from here.
model_dir = './pretrained_models/tensorflow/ssd_mobilenet_v1_coco_2018_01_28/'
pipeline_config_path = os.path.join(model_dir, 'pipeline.config')
config = tf.estimator.RunConfig(model_dir=model_dir)
train_and_eval_dict = model_lib.create_estimator_and_inputs(
run_config=config,
hparams=model_hparams.create_hparams(None),
pipeline_config_path=pipeline_config_path,
train_steps=None,
sample_1_of_n_eval_examples=1,
sample_1_of_n_eval_on_train_examples=(5))
estimator = train_and_eval_dict['estimator']
Run inference
output_dict1 = estimator.predict(predict_input_fn)
This prints out some log messages, one of which is:
INFO:tensorflow:Restoring parameters from ./pretrained_models/tensorflow/ssd_mobilenet_v1_coco_2018_01_28/model.ckpt
So it seems like pre-trained weights are getting loaded. However results look like:
Load same model as a Predictor
from tensorflow.contrib import predictor
model_dir = './pretrained_models/tensorflow/ssd_mobilenet_v1_coco_2018_01_28'
saved_model_dir = os.path.join(model_dir, 'saved_model')
predict_fn = predictor.from_saved_model(saved_model_dir)
Run inference
output_dict2 = predict_fn({'inputs': image_np_expanded})
Results look good:
When you load the model as an estimator and from a checkpoint file, here is the restore function associated with ssd models. From ssd_meta_arch.py
def restore_map(self,
fine_tune_checkpoint_type='detection',
load_all_detection_checkpoint_vars=False):
"""Returns a map of variables to load from a foreign checkpoint.
See parent class for details.
Args:
fine_tune_checkpoint_type: whether to restore from a full detection
checkpoint (with compatible variable names) or to restore from a
classification checkpoint for initialization prior to training.
Valid values: `detection`, `classification`. Default 'detection'.
load_all_detection_checkpoint_vars: whether to load all variables (when
`fine_tune_checkpoint_type='detection'`). If False, only variables
within the appropriate scopes are included. Default False.
Returns:
A dict mapping variable names (to load from a checkpoint) to variables in
the model graph.
Raises:
ValueError: if fine_tune_checkpoint_type is neither `classification`
nor `detection`.
"""
if fine_tune_checkpoint_type not in ['detection', 'classification']:
raise ValueError('Not supported fine_tune_checkpoint_type: {}'.format(
fine_tune_checkpoint_type))
if fine_tune_checkpoint_type == 'classification':
return self._feature_extractor.restore_from_classification_checkpoint_fn(
self._extract_features_scope)
if fine_tune_checkpoint_type == 'detection':
variables_to_restore = {}
for variable in tf.global_variables():
var_name = variable.op.name
if load_all_detection_checkpoint_vars:
variables_to_restore[var_name] = variable
else:
if var_name.startswith(self._extract_features_scope):
variables_to_restore[var_name] = variable
return variables_to_restore
As you can see even if the config file sets from_detection_checkpoint: True, only the variables in the feature extractor scope will be restored. To restore all the variables, you will have to set
load_all_detection_checkpoint_vars: True
in the config file.
So, the above situation is quite clear. When load the model as an Estimator, only the variables from feature extractor scope will be restored, and the predictors's scope weights are not restored, the estimator would obviously give random predictions.
When load the model as a predictor, all weights are loaded thus the predictions are reasonable.

Using tf.set_random_seed with tf.estimator.Estimator

I am using the tf.estimator.Estimator to manage training and testing part of my code. I am tuning some hyperparameters so I need to make sure that weights are initialized with the same random seed. Is there anyway to set_random_seed for a session created by tf.estimator?
You should define the random seed in the configuration passed to the estimator:
seed = 2018
config = tf.estimator.RunConfig(model_dir=model_dir, tf_random_seed=seed)
estimator = tf.estimator.Estimator(model_fn, config=config, params=params)
Here is the documentation for RunConfig.
One thing to be careful about is that each time you run estimator.train(train_input_fn), a new graph is created to train the model (by calling train_input_fn to create the input pipeline and calling model_fn on the output of train_input_fn).
One issue is that this new graph will also be set with the same random seed each time.
Example
Let me explain with an example. Suppose you perform data augmentation in your input pipeline, and you evaluate your model every epoch. This would give you something like that:
def train_input_fn():
features = tf.random_uniform([])
labels = tf.random_uniform([])
dataset = tf.data.Dataset.from_tensors((features, labels))
return dataset
def model_fn(features, labels, mode, params):
loss = features
global_step = tf.train.get_global_step()
train_op = global_step.assign_add(1)
return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
seed = 2018
config = tf.estimator.RunConfig(model_dir="test", tf_random_seed=seed)
estimator = tf.estimator.Estimator(model_fn, config=config)
num_epochs = 10
for epoch in range(num_epochs):
estimator.train(train_input_fn, steps=1)
estimator.evaluate(train_input_fn, steps=1)
The input function creates random features (and labels). What happens is that the features created are going to be exactly the same at each epoch. The output will be like:
INFO:tensorflow:loss = 0.17983198, step = 1
INFO:tensorflow:Saving dict for global step 1: global_step = 1, loss = 0.006007552
INFO:tensorflow:loss = 0.17983198, step = 2
INFO:tensorflow:Saving dict for global step 2: global_step = 2, loss = 0.006007552
INFO:tensorflow:loss = 0.17983198, step = 3
INFO:tensorflow:Saving dict for global step 3: global_step = 3, loss = 0.006007552
...
You can see that the loss (equal to the input features) is the same at each epoch, which means the same random seed is used at each epoch.
This is an issue if you want to evaluate at each epoch and perform data augmentation, because you will end up with the same data augmentation at each epoch.
Solution
One quick solution is to remove the random seed. However this prevents you from running reproducible experiments.
Another better solution is to create a new estimator at each epoch with the same model_fn but a different random seed:
seed = 2018
num_epochs = 10
for epoch in range(num_epochs):
config = tf.estimator.RunConfig(model_dir="test", tf_random_seed=seed + epoch)
estimator = tf.estimator.Estimator(model_fn, config=config)
estimator.train(train_input_fn, steps=1)
estimator.evaluate(train_input_fn, steps=1)
The features will change correctly at each epoch:
INFO:tensorflow:loss = 0.17983198, step = 1
INFO:tensorflow:Saving dict for global step 1: global_step = 1, loss = 0.006007552
INFO:tensorflow:loss = 0.22154999, step = 2
INFO:tensorflow:Saving dict for global step 2: global_step = 2, loss = 0.70446754
INFO:tensorflow:loss = 0.48594844, step = 3

Keras sequential model to Tensorflow EstimatorSpec accuracy decreases

Having some issues converting from a Keras (keras_model_fn) over to a TF model_fn for use in Sagemaker.
The models look like this:
Keras
def keras_model_fn(hyperparameters):
model = tf.keras.Sequential()
# increase input_dim (cur 2500) as amount of words go up
model.add(tf.keras.layers.InputLayer(input_shape=[8], name='main_input'))
model.add(tf.keras.layers.Embedding(2500, 128, input_length=8))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(NUM_CLASSES, activation='softmax'))
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['acc']
)
return model
Tensorflow
def model_fn(features, labels, mode, params):
input_layer = tf.keras.layers.InputLayer(
input_shape=(8,))(features[INPUT_TENSOR_NAME])
embedding_layer = tf.keras.layers.Embedding(
2500,
128,
input_length=8)(input_layer)
flattened = tf.keras.layers.Flatten()(embedding_layer)
predictions = tf.keras.layers.Dense(
NUM_CLASSES,
activation='softmax')(flattened)
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(
mode=mode,
predictions={"output": predictions})
loss = tf.losses.softmax_cross_entropy(labels, predictions)
train_op = tf.contrib.layers.optimize_loss(
loss=loss,
global_step=tf.train.get_global_step(),
learning_rate=0.001,
optimizer="Adam")
predictions_dict = {"output": predictions}
eval_metric_ops = {
"accuracy": tf.metrics.accuracy(
tf.cast(labels,tf.int32), predictions)
}
return tf.estimator.EstimatorSpec(
mode=mode,
loss=loss,
train_op=train_op,
eval_metric_ops=eval_metric_ops
)
The training and eval data is identical. Feeding in an array of padded text sequences (length 8). With an expected output of 1/5 labels.
The Losses
I'm assuming the problem lies in the loss function. I can't quite figure out what the Sequential model is doing behind the scenes versus what my tensorflow model is doing.
In the Keras model, I'm getting the following loss.
INFO:tensorflow:global_step/sec: 170.783
INFO:tensorflow:loss = 0.0018957269, step = 1701 (0.586 sec)
INFO:tensorflow:global_step/sec: 164.419
INFO:tensorflow:loss = 0.029586311, step = 1801 (0.608 sec)
INFO:tensorflow:global_step/sec: 155.381
INFO:tensorflow:loss = 0.0019212833, step = 1901 (0.644 sec)
INFO:tensorflow:Loss for final step: 0.0023477676.
In the Converted model, I'm getting the following.
INFO:tensorflow:loss = 1.232958, step = 1701 (0.354 sec)
INFO:tensorflow:global_step/sec: 280.328
INFO:tensorflow:loss = 1.0923336, step = 1801 (0.357 sec)
INFO:tensorflow:global_step/sec: 291.823
INFO:tensorflow:loss = 1.4360821, step = 1901 (0.343 sec)
INFO:tensorflow:Loss for final step: 1.0532712.
As expected the accuracy on the Converted model (for the data it was trained on) hits around 60%. The accuracy for the Keras model is at 100%.
My question here is does everything look right in the conversion? What could I be doing different with the converted model to get similar performance?
I've started to dig around in the Keras source code to see what the model compile function is doing with targets/outputs, but was going to reach out here as well to see if anyone has a suggestion/ran into this before.
The problem is probably that you're applying two softmax activations in the TensorFlow version. Note that tf.losses.softmax_cross_entropy expects unscaled logits. You could do the following:
logits = tf.keras.layers.Dense(
NUM_CLASSES)(flattened)
predictions = tf.keras.layers.Activation(
'softmax')(logits)
loss = tf.losses.softmax_cross_entropy(labels, logits)

Trouble restoring checkpointed TensorFlow net

I have built an auto encoder to "convert" the activations from VGG19.relu4_1 into pixels. I use the new convenience functions in tensorflow.contrib.layers (as in TF 0.10rc0). The code is have similar layout as TensorFlow's CIFAR10 tutorial with a train.py that does the training and checkpoints the model to disk and one eval.py that polls for new checkpoints files and run inference on them.
My problem is that the evaluation is never as good as the training, neither in terms of the value of the loss function nor when I look at the output images (even when running on the same images as the training does). This makes me think it has something to do with the restore process.
When I look at the output from the training in TensorBoard it looks good (eventually) so I don't think there is anything wrong with my net per se.
My net looks like this:
import tensorflow.contrib.layers as contrib
bn_params = {
"is_training": is_training,
"center": True,
"scale": True
}
tensor = contrib.convolution2d_transpose(vgg_output, 64*4, 4,
stride=2,
normalizer_fn=contrib.batch_norm,
normalizer_params=bn_params,
scope="deconv1")
tensor = contrib.convolution2d_transpose(tensor, 64*2, 4,
stride=2,
normalizer_fn=contrib.batch_norm,
normalizer_params=bn_params,
scope="deconv2")
.
.
.
And in train.py I do this to save the checkpoint:
variable_averages = tf.train.ExponentialMovingAverage(mynet.MOVING_AVERAGE_DECAY)
variables_averages_op = variable_averages.apply(tf.trainable_variables())
with tf.control_dependencies([apply_gradient_op, variables_averages_op]):
train_op = tf.no_op(name='train')
while training:
# train (with batch normalization's is_training = True)
if time_to_checkpoint:
saver.save(sess, checkpoint_path, global_step=step)
In eval.py I do this:
# run code that creates the net
variable_averages = tf.train.ExponentialMovingAverage(
mynet.MOVING_AVERAGE_DECAY)
saver = tf.train.Saver(variable_averages.variables_to_restore())
while polling:
# sleep and check for new checkpoint files
with tf.Session() as sess:
init = tf.initialize_all_variables()
init_local = tf.initialize_local_variables()
sess.run([init, init_local])
saver.restore(sess, checkpoint_path)
# run inference (with batch normalization's is_training = False)
The blue is the training loss, and the orange is the eval loss.
The problem was that I used the tf.train.AdamOptimizer() directly. During the optimization it didn't call the operations defined in contrib.batch_norm to calculate the running mean/variance of the input so the mean/variance was always 0.0/1.0.
The solution is to add a dependency to the GraphKeys.UPDATE_OPS collection. There already is a function defined in the contrib module that does this (optimize_loss())