How to save training model at each training step instead of periodic save based on time interval.? - in TensorFlow-Slim - tensorflow

slim.learning.train(...) accepts two arguments pertaining to saving the model(save_interval_secs) or saving the summaries(save_summaries_secs). The problem with this API is, it only allows to save the model/summary based on some "time interval" but I need to do this based on "each step" of the training.
how to achieve this using TF-slim api.?
Here is the slim.learning train api -
def train(train_op,
logdir,
train_step_fn=train_step,
train_step_kwargs=_USE_DEFAULT,
log_every_n_steps=1,
graph=None,
master='',
is_chief=True,
global_step=None,
number_of_steps=None,
init_op=_USE_DEFAULT,
init_feed_dict=None,
local_init_op=_USE_DEFAULT,
init_fn=None,
ready_op=_USE_DEFAULT,
summary_op=_USE_DEFAULT,
**save_summaries_secs=600,**
summary_writer=_USE_DEFAULT,
startup_delay_steps=0,
saver=None,
**save_interval_secs=600,**
sync_optimizer=None,
session_config=None,
session_wrapper=None,
trace_every_n_steps=None,
ignore_live_threads=False):

Slim is deprecated, and using Estimator you get full control over saving / summary frequency.
You can also set the seconds to a very small number so it always saves.

Related

How to save the best model instead of the last one for Detectron2

I want to save the best model instead of the last model for detectron2. The evaluation metric I want to use is AP50 or something similar. The code I currently have is:
trainer.register_hooks([
EvalHook(eval_period=20, eval_function=lambda:{'AP50':function?}),
BestCheckpointer(eval_period=20, checkpointer=trainer.checkpointer, val_metric= "AP50", mode="max")
])
But I have no idea what I have to substitute for the function in EvalHook. I use a subset of the coco dataset to train the model, and I saw that detectron2 contains some evaluation measures for the coco dataset, but I have no idea how to implement this.
This notebook has an implementation of what you asked and what I am searching for...
trainer.resume_or_load(resume=False)
if cfg.TEST.AUG.ENABLED:
trainer.register_hooks(
[hooks.EvalHook(0, lambda: trainer.test_with_TTA(cfg, trainer.model))] #this block uses a hook to run evalutaion periodically
) #https://detectron2.readthedocs.io/en/latest/modules/engine.html#detectron2.engine.hooks.EvalHook
trainer.train()
Try this...
I will report here if it works.

Reload Keras-Tuner Trials from the directory

I'm trying to reload or access the Keras-Tuner Trials after the Tuner's search has completed for inspecting the results. I'm not able to find any documentation or answers related to this issue.
For example, I set up BayesianOptimization to search for the best hyper-parameters as follows:
## Build Hyper Parameter Search
tuner = kt.BayesianOptimization(build_model,
objective='val_categorical_accuracy',
max_trials=10,
directory='kt_dir',
project_name='lstm_dense_bo')
tuner.search((X_train_seq, X_train_num), y_train_cat,
epochs=30,
batch_size=64,
validation_data=((X_val_seq, X_val_num), y_val_cat),
callbacks=[callbacks.EarlyStopping(monitor='val_loss', patience=3,
restore_best_weights=True)])
I see this creates trial files in the directory kt_dir with project name lstm_dense_bo such as below:
Now, if I restart my Jupyter kernel, how can I reload these trials into a Tuner object and subsequently inspect the best model or the best hyperparameters or the best trial?
I'd very much appreciate your help. Thank you
I was trying to do the same thing. I was looking into the keras docs for an easier way than this but could not find one - so if any other SO-ers have a better idea, please let us know!
Load the previous tuner. Make sure overwrite=False or else you'll delete your trials.
workdir = "mlp_202202151345"
obj = "val_recall"
tuner = kt.Hyperband(
hypermodel=build_model,
metrics=metrics,
objective=kt.Objective(obj, direction="max"),
executions_per_trial=1,
overwrite=False,
directory=workdir,
project_name="keras_tuner",
)
Look for a trial you want to load. Note that TensorBoard works really well for this. In this example, I'm loading 1a38ebaba07b77501999cb1c4ab9413e.
Here's the part that I could not find in Keras docs. This might be dependent on the tuner you use (I am using Hyperband):
tuner.oracle.get_trial('1a38ebaba07b77501999cb1c4ab9413e')
Returns a Trial object (also could not find in the docs). The Trial object has a hyperparameters attribute that will return that trial's hyperparameters. Now:
tuner.hypermodel.build(trial.hyperparameters)
Gives you the trial's model for training, evaluation, predictions, etc.
NOTE This seems convuluted and hacky, would love to see a better way.
j7skov has correctly mentioned that you need to reload previous tuner and set the parameter overwrite=False(so that tuner will not overwrite already generated trials).
Further if you want to load first K best models then we need to use tuner's get_best_models method as below
# This will load 10 best hyper tuned models with the weights
# corresponding to their best checkpoint (at the end of the best epoch of best trial).
best_model_count = 10
bo_tuner_best_models = tuner.get_best_models(num_models=best_model_count)
Then you can access a specific best model as below
best_model_id = 7
model = bo_tuner_best_models[best_model_id]
This method is for querying the models trained during the search. For best performance, it is recommended to retrain your Model on the full dataset using the best hyperparameters found during search, which can be obtained using tuner.get_best_hyperparameters().
tuner_best_hyperparameters = tuner.get_best_hyperparameters(num_trials=best_model_count)
best_hp = tuner_best_hyperparameters[best_model_id]
model = tuner.hypermodel.build(best_hp)
If you want to just display hyperparameters for the K best models then use tuner's results_summary method as below
tuner.results_summary(num_trials=best_model_count)
For further reference visit this page.
Inspired by j7skov, I found that the models can be reloaded
by manipulating tuner.oracle.trials and tuner.load_model.
By assigning tuner.oracle.trials to a variable, we can find that it is a dict object containing all relavant trials in the tuning process.
The keys of the dictionary are the trial_id, and the values of the
dictionary are the instance of the Trial object.
Alternatively, we can return the best few trials by using tuner.oracle.get_best_trials.
To inspect the hyperparameters of the trial, we can use the summary method of the instance.
To load the model, we can pass the trial instance to tuner.load_model.
Beware that different versions can lead to incompatibilities.
For example the directory structure is a little different between keras-tuner==1.0 and keras-tuner==1.1 as far as I know.
Using your example, the working flow may be summarized as follows.
# Recreate the tuner object
tuner = kt.BayesianOptimization(build_model,
objective='val_categorical_accuracy',
max_trials=10,
directory='kt_dir',
project_name='lstm_dense_bo',
overwrite=False)
# Return all trials from the oracle
trials = tuner.oracle.trials
# Print out the ID and the score of all trials
for trial_id, trial in trials.items():
print(trial_id, trial.score)
# Return best 5 trials
best_trials = tuner.oracle.get_best_trials(num_trials=5)
for trial in best_trials:
trial.summary()
model = tuner.load_model(trial)
# Do some stuff to the model
using
tuner = kt.BayesianOptimization(build_model,
objective='val_categorical_accuracy',
max_trials=10,
directory='kt_dir',
project_name='lstm_dense_bo')
will load the tuner again.

How to add custom evaluation metrics in Tensorflow Object Detection API?

I would like to have my custom list of metrics when evaluating an instance segmentation model in Tensorflow's Object Detection API, which can be summarized as follows;
Precision values for IOUs of 0.5-0.95 with increments of 0.05
Recall values for IOUs of 0.5-0.95 with increments of 0.05
AUC values for precision and recall between 0-1 with increments of 0.05
What I've currently tested is modifying the already existing coco evaluation metrics by tweaking some code in the PythonAPI of pycocotools and the additional metrics file within Tensorflow's research model. Currently the default output values for COCO evaluation are the following
Precision/mAP
Precision/mAP#.50IOU
Precision/mAP#.75IOU
Precision/mAP (small)
Precision/mAP (medium)
Precision/mAP (large)
Recall/AR#1
Recall/AR#10
Recall/AR#100
Recall/AR#100 (small)
Recall/AR#100 (medium)
Recall/AR#100 (large)
So I decided first to use coco_detection_metrics in my eval_config field inside the .config file used for training
eval_config: {
metrics_set: "coco_detection_metrics"
}
And edit cocoeval.py and cocotools.py multiple times (proportional to the number of values) by adding more items to the stats list and stats sumary dictionary in order to get the desired result. For demonstration purposes, I am only going to show one example by adding precision at IOU=0.55 on top of precision at IOU=0.5.
So, this is the modified method of the COCOeval class inside cocoeval.py
def _summarizeDets():
stats[1] = _summarize(1, iouThr=.5, maxDets=self.params.maxDets[2])
stats[12] = _summarize(1, iouThr=.5, maxDets=self.params.maxDets[2])
and the edited methods under the COCOEvalWrapper class inside coco_tools.py
summary_metrics = OrderedDict([
('Precision/mAP#.50IOU', self.stats[1]),
('Precision/mAP#.55IOU', self.stats[12])
for category_index, category_id in enumerate(self.GetCategoryIdList()):
per_category_ap['Precision mAP#.50IOU ByCategory/{}'.format( category)] = self.category_stats[1][category_index]
per_category_ap['Precision mAP#.55IOU ByCategory/{}'.format( category)] = self.category_stats[12][category_index]
It would be useful to know a more efficient way to deal with my problem and easily request a list of custom evaluation metrics without having to tweak the already existing COCO files. Ideally, my primary goal is to
Be able to create a custom console output based on the metrics provided at the beginning of the question
and my secondary goals would be to
Export the metrics with their respective values in JSON format
Visualize the three graphs in Tensorboard

TensorFlow Supervisor just stores the latest five models

I am using TensorFlow's Supervisor to train my own model. I followed the official guide to set save_model_secs to be 600. However, I strangely find the path log_dir merely saves the latest five models and automatically discard models generated earlier. I carefully read the source code supervisor.py but cannot find the relevant removal code or mechanism why just five models can be saved all along the training process. Does any have any hint to help me? Any help is really appreciated.
tf.train.Supervisor has a saver argument. If not given, it will use a default. This is configured to only store the last five checkpoints. You can overwrite this by passing your own tf.train.Saver object.
See here for the docs. There are essentially two ways of storing more checkpoints when creating the Saver:
Pass some large integer to the max_to_keep argument. If you have enough storage, passing 0 or None should result in all checkpoints being kept.
Saver also has an argument keep_checkpoint_every_n_hours. This will give you a separate "stream" of checkpoints that will be kept indefinitely. So for example you could store checkponts every 600 seconds (via the save_model_secs argument to Supervisor), but only keep the five most recent of those, but additionally save checkpoints each, say, 30 minutes (0.5 hours) all of which will be kept.

Tensorflow--how to limit epochs with evaluation only?

Given that I train a model; save it off with metagraph/save.Saver, and the load that graph into a new script/process to test against test data, what is the best way to make sure I only iterate over the test data once?
With my training data, I want to be able to iterate over the entire data set for an arbitrary number of iterations. I use
tf.train.string_input_producer()
to drive a queue of loading files for training, so I can safely leave num_epochs as default (=None) and let other controls drive training termination.
However, when I run the graph for evaluation, I just want to the evaluate the test set once (and gather the appropriate statistics).
Initial attempted solution:
Make a tensor for Epochs, and pass that into tf.train.string_input_producer, and then tf.Assign it to the appropriate value based on test/train.
But:
tf.train.string_input_producer only takes integers as num_epochs, so this isn't possible...unless I'm missing something.
Further notes: I use
tf.train.batch()
to read-in test/train data that has been serialized into protocol buffers (https://www.tensorflow.org/versions/r0.11/how_tos/reading_data/index.html#file-formats), so I have minimal visibility into how the data is loaded and how far along it is.
tf.train.batch apparently will throw tf.errors.OutOfRangeError, but I'm not clear how to catch that successfully, or if that is even what I really want to do. I tried a very naive
try...except...finally
(like in https://www.tensorflow.org/versions/r0.11/how_tos/reading_data/index.html#creating-threads-to-prefetch-using-queuerunner-objects), which didn't catch the error from tf.train.batch.