Tensorflow Estimator - Periodic Evaluation on Eval Dataset - tensorflow

The tensorflow documentation does not provide any example of how to perform a periodic evaluation of the model on an evaluation set.
Some people suggested the use of an Experiment, which sounds great but unfortunately does not work (depreciation and triggers an error).
Others suggested the use of SummarySaverHook, but I don't see how you can use that with an evaluation set (as opposed to the training set).
A solution would be to do the following
for i in range(number_of_epoch):
estimator.train(...) // on training set
estimator.evaluate(...) // on evaluation set
This architecture is explicitly discouraged in this paper (page 4 top right).
Any other idea/implementation?
EDIT:
The error message when running the experiment is the following:
File ".../anaconda2/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 253, in train if (config.environment != run_config.Environment.LOCAL and
AttributeError: 'RunConfig' object has no attribute 'environment'
Tensorflow version 1.3

Only a few parameters/options of Experiment are deprecated (what specific errors are you seeing?). If you create an Estimator that will do periodic checkpoints (using options in RunConfig) and an Experiment using it, you will get evaluation for each checkpoint by default when using train_and_evaluate method.
EDIT: As Maxime pointed out in the comments. He needed to add the following lines to get rid of his error:
os.environ['TF_CONFIG'] = json.dumps({'environment': 'local'})
config = tf.contrib.learn.RunConfig()

Related

Reload Keras-Tuner Trials from the directory

I'm trying to reload or access the Keras-Tuner Trials after the Tuner's search has completed for inspecting the results. I'm not able to find any documentation or answers related to this issue.
For example, I set up BayesianOptimization to search for the best hyper-parameters as follows:
## Build Hyper Parameter Search
tuner = kt.BayesianOptimization(build_model,
objective='val_categorical_accuracy',
max_trials=10,
directory='kt_dir',
project_name='lstm_dense_bo')
tuner.search((X_train_seq, X_train_num), y_train_cat,
epochs=30,
batch_size=64,
validation_data=((X_val_seq, X_val_num), y_val_cat),
callbacks=[callbacks.EarlyStopping(monitor='val_loss', patience=3,
restore_best_weights=True)])
I see this creates trial files in the directory kt_dir with project name lstm_dense_bo such as below:
Now, if I restart my Jupyter kernel, how can I reload these trials into a Tuner object and subsequently inspect the best model or the best hyperparameters or the best trial?
I'd very much appreciate your help. Thank you
I was trying to do the same thing. I was looking into the keras docs for an easier way than this but could not find one - so if any other SO-ers have a better idea, please let us know!
Load the previous tuner. Make sure overwrite=False or else you'll delete your trials.
workdir = "mlp_202202151345"
obj = "val_recall"
tuner = kt.Hyperband(
hypermodel=build_model,
metrics=metrics,
objective=kt.Objective(obj, direction="max"),
executions_per_trial=1,
overwrite=False,
directory=workdir,
project_name="keras_tuner",
)
Look for a trial you want to load. Note that TensorBoard works really well for this. In this example, I'm loading 1a38ebaba07b77501999cb1c4ab9413e.
Here's the part that I could not find in Keras docs. This might be dependent on the tuner you use (I am using Hyperband):
tuner.oracle.get_trial('1a38ebaba07b77501999cb1c4ab9413e')
Returns a Trial object (also could not find in the docs). The Trial object has a hyperparameters attribute that will return that trial's hyperparameters. Now:
tuner.hypermodel.build(trial.hyperparameters)
Gives you the trial's model for training, evaluation, predictions, etc.
NOTE This seems convuluted and hacky, would love to see a better way.
j7skov has correctly mentioned that you need to reload previous tuner and set the parameter overwrite=False(so that tuner will not overwrite already generated trials).
Further if you want to load first K best models then we need to use tuner's get_best_models method as below
# This will load 10 best hyper tuned models with the weights
# corresponding to their best checkpoint (at the end of the best epoch of best trial).
best_model_count = 10
bo_tuner_best_models = tuner.get_best_models(num_models=best_model_count)
Then you can access a specific best model as below
best_model_id = 7
model = bo_tuner_best_models[best_model_id]
This method is for querying the models trained during the search. For best performance, it is recommended to retrain your Model on the full dataset using the best hyperparameters found during search, which can be obtained using tuner.get_best_hyperparameters().
tuner_best_hyperparameters = tuner.get_best_hyperparameters(num_trials=best_model_count)
best_hp = tuner_best_hyperparameters[best_model_id]
model = tuner.hypermodel.build(best_hp)
If you want to just display hyperparameters for the K best models then use tuner's results_summary method as below
tuner.results_summary(num_trials=best_model_count)
For further reference visit this page.
Inspired by j7skov, I found that the models can be reloaded
by manipulating tuner.oracle.trials and tuner.load_model.
By assigning tuner.oracle.trials to a variable, we can find that it is a dict object containing all relavant trials in the tuning process.
The keys of the dictionary are the trial_id, and the values of the
dictionary are the instance of the Trial object.
Alternatively, we can return the best few trials by using tuner.oracle.get_best_trials.
To inspect the hyperparameters of the trial, we can use the summary method of the instance.
To load the model, we can pass the trial instance to tuner.load_model.
Beware that different versions can lead to incompatibilities.
For example the directory structure is a little different between keras-tuner==1.0 and keras-tuner==1.1 as far as I know.
Using your example, the working flow may be summarized as follows.
# Recreate the tuner object
tuner = kt.BayesianOptimization(build_model,
objective='val_categorical_accuracy',
max_trials=10,
directory='kt_dir',
project_name='lstm_dense_bo',
overwrite=False)
# Return all trials from the oracle
trials = tuner.oracle.trials
# Print out the ID and the score of all trials
for trial_id, trial in trials.items():
print(trial_id, trial.score)
# Return best 5 trials
best_trials = tuner.oracle.get_best_trials(num_trials=5)
for trial in best_trials:
trial.summary()
model = tuner.load_model(trial)
# Do some stuff to the model
using
tuner = kt.BayesianOptimization(build_model,
objective='val_categorical_accuracy',
max_trials=10,
directory='kt_dir',
project_name='lstm_dense_bo')
will load the tuner again.

GluonCV inference with finetuned model - “Please make sure source and target networks have the same prefix” error

I used GluonCV to finetune an object detection model in order to recognize some custom classes, mostly following the related tutorial.
I tried using both “ssd_512_resnet50_v1_coco” and “ssd_512_mobilenet1.0_coco” as base models, and the training process ended successfully (the accuracy on the validation dataset is reasonably high).
The problem is, I tried running inference with the newly trained model, by using for example:
classes = ["CML_mug", "person"]
net = gcv.model_zoo.get_model('ssd_512_mobilenet1.0_custom',
classes=classes,
pretrained_base=False,
ctx=ctx)
net.load_params("saved_weights/-0070.params", ctx=ctx)
but I get the error:
AssertionError: Parameter 'mobilenet0_conv0_weight' is missing in file: saved_weights/CML_mobilenet_00/-0000.params, which contains parameters: 'ssd0_ssd0_mobilenet0_conv0_weight', 'ssd0_ssd0_mobilenet0_batchnorm0_gamma', 'ssd0_ssd0_mobilenet0_batchnorm0_beta', ..., 'ssd0_ssd0_ssdanchorgenerator2_anchor_2', 'ssd0_ssd0_ssdanchorgenerator3_anchor_3', 'ssd0_ssd0_ssdanchorgenerator4_anchor_4', 'ssd0_ssd0_ssdanchorgenerator5_anchor_5'. Please make sure source and target networks have the same prefix.
So, it seems the network parameters are named differently in the .params file and in the model I’m using for inference. Specifically, in the .params file, the name of the network weights is prefixed by the string “ssd0_ssd0_”, which lead to the error when invoking net.load_parameters.
I did this whole procedure a few times in the past without having problems, did anything change? I’m running it on Ubuntu 18.04, with mxnet-mkl (1.6.0) and gluoncv (0.7.0).
I tried loading the .params file by:
from mxnet import nd
model = nd.load(0070.param)
and I wanted to modify it and remove the “ssd0_ssd0_” string that is causing the problem.
I’m trying to navigate the dictionary, but between the keys I only found a:
ssd0_resnetv10_conv0_weight
so, slightly different than indicated in the error.
Anyway, this way of fixing the issue would be a little cumbersome, I’d prefer a more direct way.
Ok, fixed it. Basically, during training I was saving the .params file by using:
net.export(param_file)
and, as I said, loading them during inference by:
net.load_parameters(param_file)
However, it doesn’t work this way, but it does if instead of export I use:
net.save_parameters(param_file)

Turning off tensorflow autograph

I'm using tf.data.Dataset.map(process_fn) instruction,
the mapping function is composed purely tensorflow graph functions, still it seems that Autograph is trying to transform them. How can I prevent it?
How can I force tensorflow to use my pice of code (that defines graph) as it is?
def process_fn(item):
assert 'image' in item
# this should be executed right not every time graph is executed
image = tf.image.convert_image_dtype(item.pop('image'), tf.float32)
image = tf.multiply(tf.subtract(image, 0.5), 2)
return image
For some reason tensorflow wants to transform this function and reports a warning its impossible and that it will be used as it is.
The question is, why there is even an attempt to use Autograph in the first place?
W0119 14:55:15.113813 140297917577024 ag_logging.py:146] Entity
<function geospatial_input.<locals>.process_fn at 0x7f991b5fe280> could
not be transformed and will be executed as-is. Please report this to
the AutoGraph team. When filing the bug, set the verbosity to 10 (on
Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
For v2.2 (and back to v1.15), you can use tf.autograph.experimental.do_not_convert:
#tf.autograph.experimental.do_not_convert
def process_fn(item):
...

Predict value of single image after training model on TPU

I still want to know how I can predict the value of an image after training the network, but it seems like it is not supported yet. Any idea for a workaround (taken from the mnist_tpu.py)?
if mode == tf.estimator.ModeKeys.PREDICT:
raise RuntimeError("mode {} is not supported yet".format(mode))
Besides Stackoverflow - anywhere else I can get support for the implementing my models using TPU?
Here is a Python program that sends an image to a TPU-trained model (ResNet in this case) and gets back a classification:
with tf.gfile.FastGFile('/some/path.jpg', 'r') as ifp:
credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials,
discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')
request_data = {'instances':
[
{"image_bytes": {"b64": base64.b64encode(ifp.read())}}
]
}
parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, MODEL, VERSION)
response = api.projects().predict(body=request_data, name=parent).execute()
print("response={0}".format(response))
Full code is here: https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/quests/tpu/flowers_resnet.ipynb
This article documents the process of writing a model for the Cloud TPU: https://medium.com/tensorflow/how-to-write-a-custom-estimator-model-for-the-cloud-tpu-7d8bd9068c26
It is supported now. Changes have been done to https://github.com/tensorflow/models/blob/master/official/mnist/mnist_tpu.py to make it working.
Besides stackoverflow, you can add your issues on github https://github.com/tensorflow/tpu/issues.
According to the documentation, you can choose online or batch modes for prediction, but you can't select the target device. As stated, "the prediction service allocates resources to run your job."
The documentation says that prediction is performed by nodes. I thought I'd read somewhere that prediction nodes are always CPUs in the Google Compute Engine, but I can't find a clear reference.

How to store best models checkpoints, not only newest 5, in Tensorflow Object Detection API?

I'm training MobileNet on WIDER FACE dataset and I encountered problem I couldn't solve. TF Object Detection API stores only last 5 checkpoints in train dir, but what I would like to do, is to save best models relative to mAP metric (or at least leave many more models in train dir before deletion).
For example, today I've looked at Tensorboard after next night of training and I see that overnight model has over-fitted and I can't restore best checkpoint, because it was deleted already.
EDIT: I just use Tensorflow Object Detection API, it by default saves last 5 checkpoints in train dir which I point. I look for some configuration parameter or anything that will change this behavior.
Has anyone have some fix in code/config param to set/workaround for that? It seems like I'm missing something, it should be obvious that what's in fact important is the best model, not the newest one (which can overfit).
Thanks!
You can modify (hardcoding in your fork or opening a pull request and adding the options to protos) the arguments passed to tf.train.Saver in:
https://github.com/tensorflow/models/blob/master/research/object_detection/legacy/trainer.py#L376-L377
You will probably want to set:
max_to_keep: Maximum number of recent checkpoints to keep. Defaults to 5.
keep_checkpoint_every_n_hours: How often to keep checkpoints. Defaults to 10,000 hours.
You can change config.
in run_config.py
class RunConfig(object):
"""This class specifies the configurations for an `Estimator` run."""
def __init__(self,
model_dir=None,
tf_random_seed=None,
save_summary_steps=100,
save_checkpoints_steps=_USE_DEFAULT,
save_checkpoints_secs=_USE_DEFAULT,
session_config=None,
keep_checkpoint_max=10,
keep_checkpoint_every_n_hours=10000,
log_step_count_steps=100,
train_distribute=None,
device_fn=None,
protocol=None,
eval_distribute=None,
experimental_distribute=None):
You may be interested by this Tf github thread that tackles the newest/best checkpoint issue. A user developed his own wrapper, chekmate, around tf.Saver to keep track of the best checkpoints.
You can follow up this PR. Here your best checkpoint is saved within your checkpoint directory, sub-directory named as best.
You just need to integrate the best_saver() and (method call in _run_checkpoint_once()) inside ../object_detection/eval_util.py
Additionally it will also create a json for all_evaluation_metrices.
For saving more checkpoints, you can write a simple python script that will store the checkpoints in a timely manner to a specific.
import os
import shutil
import time
while True:
training_file = '/home/vignesh/training' # path of your train directory
archive_file = 'home/vignesh/training/archive' #path of the directory where you want to save your checkpoints
files_to_save = []
for files in os.listdir(training_file):
if files.rsplit('.')[0]=='model':
files_to_save.append(files)
for files in files_to_save:
if files in os.listdir(archive_file):
pass
else:
shutil.copy2(training_file+'/'+files,archive_file)
time.sleep(600) # This will make the script run for every 600 seconds, modify it for your need