Step is always 0 when using tf.estimator.Estimator - tensorflow

I have been trying to learn the layers and estimators framework that was recently moved from contrib to main API. I ran into a rather strange problem. I wrote a simple autoencoder for MNIST, but somehow, when I train it keeps saying I am at step 0 even though the loss value is decreasing, so I guess the model is getting trained. Of course, since it is not counting steps, it is not saving the checkpoints and it is not saving any summaries either. Not sure what I am doing wrong and all the docs point to the old "tf.contrib.learn" framework and a lot of APIs there seem to be marked as deprecated. How do I make this work? Here is my code:
def encoder(x):
l1 = tf.layers.dense(x, 256, activation=tf.nn.relu, name='encode1')
l2 = tf.layers.dense(l1, 128, activation=tf.nn.relu, name='encode2')
return l2
def decoder(x):
l1 = tf.layers.dense(x, 256, activation=tf.nn.relu, name='decode1')
l2 = tf.layers.dense(l1, 784, activation=tf.nn.relu, name='decode2')
return l2
def loss(labels, preds):
return tf.losses.huber_loss(labels, preds)
def train(loss):
optimizer = tf.train.AdamOptimizer()
return optimizer.minimize(loss)
def model_fn(features, labels, mode):
_encoder = encoder(features)
_decoder = decoder(_encoder)
_loss = loss(labels, _decoder)
_train = train(_loss)
return tf.estimator.EstimatorSpec(mode=mode,
predictions=_decoder,
loss=_loss,
train_op=_train)
data = input_data.read_data_sets(".", one_hot=True)
display.clear_output()
# remove current log dir
shutil.rmtree('logs', ignore_errors=True)
def input_fn():
if data.train.epochs_completed <= 10:
features, labels = data.train.next_batch(100)
return tf.constant(features), tf.constant(features)
raise StopIteration
estimator = tf.estimator.Estimator(model_fn, "logs")
estimator.train(input_fn=input_fn)
And here is some sample output
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'logs', '_tf_random_seed': 1, '_save_summary_steps': 100, '_save_checkpoints_secs': 600, '_save_checkpoints_steps': None, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 0 into logs/model.ckpt.
INFO:tensorflow:loss = 0.0505481, step = 0
INFO:tensorflow:loss = 0.00319921, step = 0 (1.125 sec)
INFO:tensorflow:loss = 0.00277268, step = 0 (1.094 sec)
INFO:tensorflow:loss = 0.00275822, step = 0 (1.106 sec)
INFO:tensorflow:loss = 0.00275116, step = 0 (1.069 sec)
INFO:tensorflow:loss = 0.00275018, step = 0 (1.130 sec)
INFO:tensorflow:loss = 0.00274921, step = 0 (1.161 sec)
INFO:tensorflow:loss = 0.00274908, step = 0 (1.140 sec)
INFO:tensorflow:loss = 0.00274683, step = 0 (1.105 sec)
INFO:tensorflow:loss = 0.00274397, step = 0 (1.111 sec)

In the training op you need to set the global_step parameter, which is the step counter that gets incremented for each model training run. So change to :
optimizer.minimize(loss, global_step=tf.train.get_global_step())

Related

How to set the checkpoint for fine-tuning

I found the loss when I retrain the model(ssd_mobilenetv2) from model_zoo is very large at the begining of training, While the accuracy on validation_set is good. Training log as below:
The log couldn't be from the trained model. I doubt it doesn't load the checkpoint to do the fine-tune. Please help me how to do the fine-tune with the trained model on the same dataset. I didn't modify the network structure at all.
I set the checkpoint path in pipeline.config as below:
fine_tune_checkpoint:"//ssd_mobilenet_v2_coco_2018_03_29/model.ckpt"
If I set the model_dir as my downloaded directory, It wouldn't train since the global_train_step is larger than max_step. Then I enlarge the max_step, I can see the log of restoring the parameter from checkpoint. But it would meet error that couldn't restore some parameter.
So I set the model_dir to a empty directory. It could train normally but the loss in step0 would be very large. And the validation result is very bad
in pipeline.config
fine_tune_checkpoint: "/ssd_mobilenet_v2_coco_2018_03_29/model.ckpt"
num_steps: 200000
fine_tune_checkpoint_type: "detection"
train script
model_dir = '/ssd_mobilenet_v2_coco_2018_03_29/retrain0524
pipeline_config_path = '/ssd_mobilenet_v2_coco_2018_03_29/pipeline.config'
checkpoint_dir = '/ssd_mobilenet_v2_coco_2018_03_29/model.ckpt'
num_train_steps = 300000
config = tf.estimator.RunConfig(model_dir=model_dir)
train_and_eval_dict = model_lib.create_estimator_and_inputs(
run_config=config,
hparams=model_hparams.create_hparams(hparams_overrides),
pipeline_config_path=pipeline_config_path,
sample_1_of_n_eval_examples=sample_1_of_n_eval_examples,
sample_1_of_n_eval_on_train_examples=(sample_1_of_n_eval_on_train_examples))
estimator = train_and_eval_dict['estimator']
train_input_fn = train_and_eval_dict['train_input_fn']
eval_input_fns = train_and_eval_dict['eval_input_fns']
eval_on_train_input_fn = train_and_eval_dict['eval_on_train_input_fn']
predict_input_fn = train_and_eval_dict['predict_input_fn']
train_steps = train_and_eval_dict['train_steps']
train_spec, eval_specs = model_lib.create_train_and_eval_specs(
train_input_fn,
eval_input_fns,
eval_on_train_input_fn,
predict_input_fn,
train_steps,
eval_on_train_data=False)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
INFO:tensorflow:loss = 356.25497, step = 0
INFO:tensorflow:global_step/sec: 1.89768
INFO:tensorflow:loss = 11.221423, step = 100 (52.700 sec)
INFO:tensorflow:global_step/sec: 2.21685
INFO:tensorflow:loss = 10.329516, step = 200 (45.109 sec)
If the initial training loss is 400, the model most likely is restored from a checkpoint successfully, just not all the same as the checkpoint.
Here is the restore_map function of ssd models, note that even if you set fine_tune_checkpoint_type : detection and even provided with exactly the same model's checkpoint, still only the variables in the feature_extractor scope are restored. To restore as much variables from the checkpoint as possible, you will have to set load_all_detection_checkpoint_vars: true in your config file.
def restore_map(self,
fine_tune_checkpoint_type='detection',
load_all_detection_checkpoint_vars=False):
if fine_tune_checkpoint_type not in ['detection', 'classification']:
raise ValueError('Not supported fine_tune_checkpoint_type: {}'.format(
fine_tune_checkpoint_type))
if fine_tune_checkpoint_type == 'classification':
return self._feature_extractor.restore_from_classification_checkpoint_fn(
self._extract_features_scope)
if fine_tune_checkpoint_type == 'detection':
variables_to_restore = {}
for variable in tf.global_variables():
var_name = variable.op.name
if load_all_detection_checkpoint_vars:
variables_to_restore[var_name] = variable
else:
if var_name.startswith(self._extract_features_scope):
variables_to_restore[var_name] = variable
return variables_to_restore

learning rate of adam optimizer do not change

I want to log the learning rate of adam optimizer with estimator of tensorflow like this:
def def model_fn(features, labels, mode):
...
optimizer = tf.train.AdamOptimizer(learning_rate=0.1)
log_hook = tf.train.LoggingTensorHook({"lr" : optimizer._lr_t}, every_n_iter=10)
return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op, training_hooks=[log_hook])
...
We know that the learning rate of tf.train.AdamOptimizer decays itself. But my result is always 1.0 like this:
INFO:tensorflow:lr = 0.1 (4.537 sec)
INFO:tensorflow:global_step/sec: 2.18827
INFO:tensorflow:loss = 8.285036e-07, step = 16180 (4.570 sec)
INFO:tensorflow:lr = 0.1 (4.570 sec)
INFO:tensorflow:global_step/sec: 2.21156
INFO:tensorflow:loss = 8.225431e-07, step = 16190 (4.521 sec)
INFO:tensorflow:lr = 0.1 (4.521 sec)
Am I do the right way for log learning rate of AdamOptimizer?
Update:
I log the optimizer._lr referenced this answer, but got this error:
ValueError: Passed 0.1 should have graph attribute that is equal to current graph <tensorflow.python.framework.ops.Graph object at 0x7f96a290a350>.

Using tf.set_random_seed with tf.estimator.Estimator

I am using the tf.estimator.Estimator to manage training and testing part of my code. I am tuning some hyperparameters so I need to make sure that weights are initialized with the same random seed. Is there anyway to set_random_seed for a session created by tf.estimator?
You should define the random seed in the configuration passed to the estimator:
seed = 2018
config = tf.estimator.RunConfig(model_dir=model_dir, tf_random_seed=seed)
estimator = tf.estimator.Estimator(model_fn, config=config, params=params)
Here is the documentation for RunConfig.
One thing to be careful about is that each time you run estimator.train(train_input_fn), a new graph is created to train the model (by calling train_input_fn to create the input pipeline and calling model_fn on the output of train_input_fn).
One issue is that this new graph will also be set with the same random seed each time.
Example
Let me explain with an example. Suppose you perform data augmentation in your input pipeline, and you evaluate your model every epoch. This would give you something like that:
def train_input_fn():
features = tf.random_uniform([])
labels = tf.random_uniform([])
dataset = tf.data.Dataset.from_tensors((features, labels))
return dataset
def model_fn(features, labels, mode, params):
loss = features
global_step = tf.train.get_global_step()
train_op = global_step.assign_add(1)
return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
seed = 2018
config = tf.estimator.RunConfig(model_dir="test", tf_random_seed=seed)
estimator = tf.estimator.Estimator(model_fn, config=config)
num_epochs = 10
for epoch in range(num_epochs):
estimator.train(train_input_fn, steps=1)
estimator.evaluate(train_input_fn, steps=1)
The input function creates random features (and labels). What happens is that the features created are going to be exactly the same at each epoch. The output will be like:
INFO:tensorflow:loss = 0.17983198, step = 1
INFO:tensorflow:Saving dict for global step 1: global_step = 1, loss = 0.006007552
INFO:tensorflow:loss = 0.17983198, step = 2
INFO:tensorflow:Saving dict for global step 2: global_step = 2, loss = 0.006007552
INFO:tensorflow:loss = 0.17983198, step = 3
INFO:tensorflow:Saving dict for global step 3: global_step = 3, loss = 0.006007552
...
You can see that the loss (equal to the input features) is the same at each epoch, which means the same random seed is used at each epoch.
This is an issue if you want to evaluate at each epoch and perform data augmentation, because you will end up with the same data augmentation at each epoch.
Solution
One quick solution is to remove the random seed. However this prevents you from running reproducible experiments.
Another better solution is to create a new estimator at each epoch with the same model_fn but a different random seed:
seed = 2018
num_epochs = 10
for epoch in range(num_epochs):
config = tf.estimator.RunConfig(model_dir="test", tf_random_seed=seed + epoch)
estimator = tf.estimator.Estimator(model_fn, config=config)
estimator.train(train_input_fn, steps=1)
estimator.evaluate(train_input_fn, steps=1)
The features will change correctly at each epoch:
INFO:tensorflow:loss = 0.17983198, step = 1
INFO:tensorflow:Saving dict for global step 1: global_step = 1, loss = 0.006007552
INFO:tensorflow:loss = 0.22154999, step = 2
INFO:tensorflow:Saving dict for global step 2: global_step = 2, loss = 0.70446754
INFO:tensorflow:loss = 0.48594844, step = 3

Keras sequential model to Tensorflow EstimatorSpec accuracy decreases

Having some issues converting from a Keras (keras_model_fn) over to a TF model_fn for use in Sagemaker.
The models look like this:
Keras
def keras_model_fn(hyperparameters):
model = tf.keras.Sequential()
# increase input_dim (cur 2500) as amount of words go up
model.add(tf.keras.layers.InputLayer(input_shape=[8], name='main_input'))
model.add(tf.keras.layers.Embedding(2500, 128, input_length=8))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(NUM_CLASSES, activation='softmax'))
model.compile(
optimizer='adam',
loss='categorical_crossentropy',
metrics=['acc']
)
return model
Tensorflow
def model_fn(features, labels, mode, params):
input_layer = tf.keras.layers.InputLayer(
input_shape=(8,))(features[INPUT_TENSOR_NAME])
embedding_layer = tf.keras.layers.Embedding(
2500,
128,
input_length=8)(input_layer)
flattened = tf.keras.layers.Flatten()(embedding_layer)
predictions = tf.keras.layers.Dense(
NUM_CLASSES,
activation='softmax')(flattened)
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(
mode=mode,
predictions={"output": predictions})
loss = tf.losses.softmax_cross_entropy(labels, predictions)
train_op = tf.contrib.layers.optimize_loss(
loss=loss,
global_step=tf.train.get_global_step(),
learning_rate=0.001,
optimizer="Adam")
predictions_dict = {"output": predictions}
eval_metric_ops = {
"accuracy": tf.metrics.accuracy(
tf.cast(labels,tf.int32), predictions)
}
return tf.estimator.EstimatorSpec(
mode=mode,
loss=loss,
train_op=train_op,
eval_metric_ops=eval_metric_ops
)
The training and eval data is identical. Feeding in an array of padded text sequences (length 8). With an expected output of 1/5 labels.
The Losses
I'm assuming the problem lies in the loss function. I can't quite figure out what the Sequential model is doing behind the scenes versus what my tensorflow model is doing.
In the Keras model, I'm getting the following loss.
INFO:tensorflow:global_step/sec: 170.783
INFO:tensorflow:loss = 0.0018957269, step = 1701 (0.586 sec)
INFO:tensorflow:global_step/sec: 164.419
INFO:tensorflow:loss = 0.029586311, step = 1801 (0.608 sec)
INFO:tensorflow:global_step/sec: 155.381
INFO:tensorflow:loss = 0.0019212833, step = 1901 (0.644 sec)
INFO:tensorflow:Loss for final step: 0.0023477676.
In the Converted model, I'm getting the following.
INFO:tensorflow:loss = 1.232958, step = 1701 (0.354 sec)
INFO:tensorflow:global_step/sec: 280.328
INFO:tensorflow:loss = 1.0923336, step = 1801 (0.357 sec)
INFO:tensorflow:global_step/sec: 291.823
INFO:tensorflow:loss = 1.4360821, step = 1901 (0.343 sec)
INFO:tensorflow:Loss for final step: 1.0532712.
As expected the accuracy on the Converted model (for the data it was trained on) hits around 60%. The accuracy for the Keras model is at 100%.
My question here is does everything look right in the conversion? What could I be doing different with the converted model to get similar performance?
I've started to dig around in the Keras source code to see what the model compile function is doing with targets/outputs, but was going to reach out here as well to see if anyone has a suggestion/ran into this before.
The problem is probably that you're applying two softmax activations in the TensorFlow version. Note that tf.losses.softmax_cross_entropy expects unscaled logits. You could do the following:
logits = tf.keras.layers.Dense(
NUM_CLASSES)(flattened)
predictions = tf.keras.layers.Activation(
'softmax')(logits)
loss = tf.losses.softmax_cross_entropy(labels, logits)

Tensorflow: global_step values does not change (always 0)

I created global_step variable through get_or_create_global_step and later access it with get_global_step, I saw during training the value I accessed did not change at all. Below is the related codes
with tf.device(self.device):
...
loss = model.loss(....)
global_step = tf.train.get_or_create_global_step()
update = model.update(loss, global_step, self.lrate, self.grad_clip)
init_op = tf.global_variables_initializer()
...
with tf.train.MonitoredTrainingSession(master=self.server.target, is_chief=(self.task_index == 0), config=config, hooks=hooks) as mon_sess:
mon_sess.run(init_op)
_, lossVal, step = mon_sess.run([update, loss, tf.train.get_global_step()])
print('step %d, train loss = %f' % (step, lossVal))
...
and model is an instance of Model class with update function implemented as below
class Model(object):
def __init__():
...
def update(self, loss, global_step, learning_rate, grad_clip):
optimizer = tf.train.AdamOptimizer(learning_rate)
grads_and_vars = optimizer.compute_gradients(loss=loss)
...
update_op = optimizer.apply_gradients(grads_and_vars=clipped_grads_and_vars,global_step=global_step, name='apply_gradients')
return update_op
above codes print out
step 0, train loss = ...
step 0, train loss = ...
step 0, train loss = ...
anyone can help on this?
thanks!