Tensorflow/AI Cloud Platform: HyperTune trials failed to report the hyperparameter tuning metric - tensorflow

I'm using the tf.estimator API with TensorFlow 2.1 on Google AI Platform to build a DNN Regressor. To use AI Platform Training hyperparameter tuning, I followed Google's docs.
I used the following configuration parameters:
config.yaml:
trainingInput:
scaleTier: BASIC
hyperparameters:
goal: MINIMIZE
maxTrials: 2
maxParallelTrials: 2
hyperparameterMetricTag: rmse
enableTrialEarlyStopping: True
params:
- parameterName: batch_size
type: DISCRETE
discreteValues:
- 100
- 200
- 300
- parameterName: lr
type: DOUBLE
minValue: 0.0001
maxValue: 0.1
scaleType: UNIT_LOG_SCALE
And to add the metric to my summary, I used the following code for my DNNRegressor:
def rmse(labels, predictions):
pred_values = predictions['predictions']
rmse = tf.keras.metrics.RootMeanSquaredError(name='root_mean_squared_error')
rmse.update_state(labels, pred_values)
return {'rmse': rmse}
def train_and_evaluate(hparams):
...
estimator = tf.estimator.DNNRegressor(
model_dir = output_dir,
feature_columns = get_cols(),
hidden_units = [max(2, int(FIRST_LAYER_SIZE * SCALE_FACTOR ** i))
for i in range(NUM_LAYERS)],
optimizer = tf.keras.optimizers.Adam(learning_rate=LEARNING_RATE),
config = run_config)
estimator = tf.estimator.add_metrics(estimator, rmse)
According to Google's documentation, the add_metric function creates a new estimator with the metric specified, which is then used as the hyperparameter metric. However, the AI Platform Training service doesn't recognise this metric:
Job details on AI Platform
On running the code locally, the rmse metric does get outputted in the logs.
So, how do I make the metric available to the Training job on AI Platform using Estimators?
Additionally, there is an option of reporting the metrics through the cloudml-hypertune Python package. But it requires the value of the metric as one of the input arguments. How do I extract the metric from tf.estimator.train_and_evaluate function (since that's the function I use to train/evaluate my estimator) to input into the report_hyperparameter_tuning_metric function?
hpt = hypertune.HyperTune()
hpt.report_hyperparameter_tuning_metric(
hyperparameter_metric_tag='rmse',
metric_value=??,
global_step=1000
)
ETA: Logs show no error. It says that the job completed successfully even though it fails.

Related

Issue with Tensorboard and nothing logging

I was following google's tensorboard tutorial with hparams here. However, when I try to implement that in my own model, nothing is showing in the logs. The main difference is that I used an Image Data Generator, but I do not see how that would affect the hyperparameters. I have included all the code used to get the hyperparameters, but removed the model and basic packages I imported for ease.
# Load the TensorBoard notebook
%load_ext tensorboard
# Clear all logs
!rm -rf ./logs/
Here is what I have set up for the hyperparameters. Just learning rate and weight decay. Slightly augmented from the tutorial, but largely very much the same style.
HP_lr = hp.HParam('learning_rate', hp.Discrete([3, 4, 5]))
HP_weight_decay= hp.HParam('l2_weight_decay', hp.Discrete([4, 5, 6]))
METRIC_ACCURACY = 'accuracy'
This is a little different to account for the values above, but those are simply variable names
# file writer
with tf.summary.create_file_writer('logs/hparam_tuning').as_default():
hp.hparams_config(
hparams=[HP_lr, HP_weight_decay],
metrics=[hp.Metric(METRIC_ACCURACY, display_name='Accuracy')],
)
I have a function that builds the model taking an hparams argument. Besides using datagen.flow() in the model.fit, nothing changes.
def train_test_model(hparams):
model = build_model(hparams)
model.fit(datagen.flow(x_train, y_train, batch_size=64),
epochs=1,verbose=0)
_, accuracy = model.evaluate(x_test, y_test,batch_size=64, verbose = 1)
return accuracy
# For each run log the metrics and hyperparameters used
def run(run_dir, hparams):
with tf.summary.create_file_writer(run_dir).as_default():
hp.hparams(hparams) # record the values used in this trial
accuracy = train_test_model(hparams)
tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)
Sets up the dictionary to be used by hp
session_num = 0
for learn_rate in HP_lr.domain.values:
for wd in HP_weight_decay.domain.values:
hparams = {
HP_lr: 1*10**(-learn_rate), # transform to something like 1e-3
HP_weight_decay: 1*10**(-wd)
}
run_name = "run-%d" % session_num
print('--- Starting trial: %s' % run_name)
print({h.name: hparams[h] for h in hparams})
run('logs/hparam_tuning/' + run_name, hparams)
session_num += 1
%tensorboard --logdir logs/hparam_tuning

Is there a way to get the current learning rate from an Estimator

I would like to keep track of the learning rate while training with an Estimator (a TPUEstimator on a TPU, of all things). I am experimenting with the Colab MNIST example. I figured I would create a training hook, which would log the learning rate. Here is my code:
class TrainHook(tf.train.SessionRunHook):
def __init__(self, optimizer):
self.optimizer = optimizer
def after_create_session(self, session, coord):
self.session = session
def before_run(self, run_context):
# optimizer is a CrossShardOptimizer (see notebook), hence ._opt
logger.info('Learning rate is {}'.format(
self.optimizer._opt._lr.eval(session=self.session))
optimizer was created like this in model_fn (copied from the notebook):
step = tf.train.get_or_create_global_step()
lr = 0.0001 + tf.train.exponential_decay(0.01, step, 2000//8, 1/math.e)
optimizer = tf.train.AdamOptimizer(lr)
if params['use_tpu']:
optimizer = tf.contrib.tpu.CrossShardOptimizer(optimizer)
Unfortunately, when I run this code, I get the following error: "Error recorded from training_loop: Operation 'add_1' has been marked as not fetchable." Apparently _opt._lr is an add_1 operation, because of exponential_decay. At least on the TPU; when I install regular Tensorflow on my laptop, it's an add operator; could this be the difference?
I know that even if I could get it, _lr is not the current, but the base learning rate. But it's a start. :)
Note: _lr and _lr_t behaves identically in that both result in the error above.
I use tensorflow v1.14.

How to check evaluation auc after every epoch when using tf.estimator.EstimatorSpec?

I defined my model using tf.estimator.EstimatorSpec. I know it has train, evaluation and prediction modes. But I want to check some metric scores such as auc after every epoch. Does this API support it like keras?
There is no direct API for adding Metrics like AUC but you can create Custom Metric Function using tf.keras.metrics, and then use those Metrics in Estimator, using tf.estimator.add_metrics.
Example code, which demonstrates the implementation of AUC is shown below:
def my_auc(labels, predictions):
auc_metric = tf.keras.metrics.AUC(name="my_auc")
auc_metric.update_state(y_true=labels, y_pred=predictions['logistic'])
return {'auc': auc_metric}
estimator = tf.estimator.DNNClassifier(...)
estimator = tf.estimator.add_metrics(estimator, my_auc)
estimator.train(...)
estimator.evaluate(...)
Or
def my_auc(labels, predictions, features):
auc_metric = tf.keras.metrics.AUC(name="my_auc")
auc_metric.update_state(y_true=labels, y_pred=predictions['logistic'],
sample_weight=features['weight'])
return {'auc': auc_metric}
estimator = tf.estimator.DNNClassifier(...)
estimator = tf.estimator.add_metrics(estimator, my_auc)
estimator.train(...)
estimator.evaluate(...)

tf.Estimator.predict() issue when using a Tensorflow Hub module as the basis of a custom tf.Estimator

I am trying to create a custom tensorflow tf.Estimator. In the model_fn passed to the tf.Estimator, I am importing the Inception_V3 module from Tensorflow Hub.
Problem: After fine-tuning the model (using tf.Estimator.train), the results obtained using tf.Estimator.predict are not as good as expected based on tf.Estimator.evaluate (This is for a regression problem.)
I am new to Tensorflow and Tensorflow Hub, so I could be making lots of rookie mistakes.
When I run tf.Estimator.evaluate() on my validation data, the reported loss is in the same ball park as the loss after tf.Estimator.train() was used to train the model. The problem comes in when I try to use tf.Estimator.predict() on the same validation data.
tf.Estimator.predict() returns predictions which I then use to calculate the same loss metric (mean_squared_error) which is computed by tf.Estimator.evaluate(). I am using the same set of data to feed to the predict function as the evaluate function. But I do not get the same result for the mean_squared_error -- not remotely close! (The mse I calculate from predict is much worse.)
Here is what I have done (edited out some details)...
Define a model_fn with Tensorflow Hub module. Then call the tf.Estimator functions to train, evaluate and predict.
def my_model_fun(features, labels, mode, params):
# Load InceptionV3 Module from Tensorflow Hub
iv3_module =hub.Module("https://tfhub.dev/google/imagenet/inception_v3/feature_vector/1",trainable=True, tags={'train'})
# Gather the variables for fine-tuning
var_list = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,scope='CustomeLayer')
var_list.extend(tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,scope='module/InceptionV3/Mixed_5b'))
predictions = {"the_prediction" : final_output}
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(mode=mode, predictions=predictions)
# Define loss, optimizer, and evaluation metrics
loss = tf.losses.mean_squared_error(labels=labels, predictions=final_output)
optimizer =tf.train.AdadeltaOptimizer(learning_rate=learn_rate).minimize(loss,
var_list=var_list, global_step=tf.train.get_global_step())
rms_error = tf.metrics.root_mean_squared_error(labels=labels,predictions=predictions["the_prediction"])
eval_metric_ops = {"rms_error": rms_error}
if mode == tf.estimator.ModeKeys.TRAIN:
return tf.estimator.EstimatorSpec(mode=mode, loss=loss,train_op=optimizer)
if mode == tf.estimator.ModeKeys.EVAL:
tf.summary.scalar('rms_error', rms_error)
return tf.estimator.EstimatorSpec(mode=mode, loss=loss,eval_metric_ops=eval_metric_ops)
iv3_estimator = tf.estimator.Estimator(model_fn=iv3_model_fn)
iv3_estimator.train(input_fn=train_input_fn, steps=TRAIN_STEPS)
iv3_estimator.evaluate(input_fn=val_input_fn)
ii =0
for ans in iv3_estimator.predict(input_fn=test_input_fn):
sqErr = np.square(label[ii] - ans['the_prediction'][0])
totalSqErr += sqErr
ii += 1
mse = totalSqErr/ii
I expect that the mse loss reported by tf.Estimator.evaluate() should be the same as the when I calculate mse from the known labels and the output of tf.Estimator.predict()
Do I need to import the Tensorflow Hub model differently when I use predict? (use trainable=False in the call to hub.Module()?
Are the weights obtained from training being used when tf.Estimator.evaluate() runs, but not when tf.Estimator.predict()- runs?
other?
There's a few things that seem to be missing from the code snippet. How is final_output computed from iv3_module? Also, mean squared error is an unusual choice of loss function for a classification problem; the common approach is to pass image features from the module into a a linear output layer with scores for each class ("logits") and a "softmax cross-entropy loss". For an explanation of these terms, you can review online tutorials like https://developers.google.com/machine-learning/crash-course/ (all the way to multi-class neural nets).
Regarding TF-Hub technicalities:
The variables of a Hub module are automatically added to the GLOBAL_VARIABLES and TRAINABLE_VARIABLES collections (if trainable=True, as you already do). No manual extension of those collections should be needed.
hub.Module(..., tags=...) should be set to {"train"} for mode==TRAIN and set to None or the empty set otherwise.
In general, it's useful to get a solution working end-to-end for your problem without fine-tuning as a baseline, and then add fine-tuning.

tf.contrib.learn API trains faster?

I am currently experimenting with the tensorflow APIs and need help with the retrain.py for retraining inception.
I am trying out the new tf.contrib.learn APIs and would like to change retrain.py to use the new high level APIs.
However, currently I am facing issues with
1. porting over the Tensorboard logging features in the original script
2. defining the input_fn to return data in minibatches
I tried finding examples online for this and couldn't find any.
May I know if any of you tried doing this before and how did you solve the problems mentioned above?
In addition to this, I would like to know if there are any differences in these 2 ways of computing the accuracy metrics. I'm asking this because I got a 96% accuracy result on the flower_photos sample dataset which is a significant improvement over the original 91% by porting over the model to tf.contrib.learn.
Method 1: Using eval_metric_ops in model_fn function
# Calculate accuracy as additional eval metric
eval_metric_ops = {
"accuracy": tf.metrics.accuracy(targets, one_hot_classes)
}
Method 2: Calculating it manually in the original retrain.py
with tf.name_scope('accuracy'):
with tf.name_scope('correct_prediction'):
prediction = tf.argmax(result_tensor, 1)
correct_prediction = tf.equal(
prediction, tf.argmax(ground_truth_tensor, 1))
with tf.name_scope('accuracy'):
evaluation_step = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))