tensorflow retrain model file - tensorflow

im getting started with tensorflow und using retrain.py to teach it some new categories - this works well - however i have some questions:
In the comments of retrain.py it says:
"This produces a new model file that can be loaded and run by any TensorFlow
program, for example the label_image sample code"
however I havent found where this new model file is saved to ?
also: it does contain the whole model, right ? not just the retrained part ?
Thanks for clearing this up

1)I think you may want to save the new model.
When you want to save a model after some process, you can use
saver.save(sess, 'directory/model-name', *optional-arg).
Check out https://www.tensorflow.org/api_docs/python/tf/train/Saver
If you change model-name by epoch or any measure you would like to use, you can save the new model(otherwise, it may overlap with previous models saved).
You can find the model saved by searching 'checkpoint', '.index', '.meta'.
2)Saving the whole model or just part of it?
It's the part you need to learn bunch of ideas on tf.session and savers. You can save either the whole or just part, it's up to you. Again, start from the above link. The moral is that you put the variables you would like to save in a list quoted as 'var_list' in the link, and you can save only for them. When you call them back, you now also need to specify which variables in your model correspond to the variables in the loaded variables.

While running retrain.py you can give --output_graph and --output_labels parameters which specify the location to save graph (default is /tmp/output_graph.pb) and the labels as well. You can change those as per your requirements.

Related

Problem when saving a machine learning keras model

I follow this tutorial on keras
https://keras.io/examples/nlp/semantic_similarity_with_bert/
I wanted to save the model with this command
model.save("saved_model/my_model")
I got this warnings when i saved the model
enter image description here
Then when i want to load the model to use it with this command
tf.keras.models.load_model('saved_model/my_model')
I got this error
enter image description here
Is this the good way to save the model ?
your first structure is inside a dict. You must extract the item from the dict to be able to get rid of your error. Try checking this out.

Error: in the file data/coco.names number of names 80 that isn't equal to classes=13

I was using Google Colab to train Yolo-v3 to detect custom objects. I'm new to Colab, and darknet.
I used the following command for training:
!./darknet detector train "/content/gdrive/My Drive/darknet/obj.data" "/content/gdrive/My Drive/darknet/cfg/yolov3-PID.cfg" "/content/gdrive/My Drive/darknet/backup/yolov3-PID_final.weights" -dont_show
The training finished as follows, and it didn't display any details of the epochs (I don't know how many epochs actually run). Actually, it took very short time until it displayed Done!, and saved the weights as shown in the above image
Then, I tried to detect a test image with the following command:
!./darknet detect "/content/gdrive/My Drive/darknet/cfg/yolov3-PID.cfg" "/content/gdrive/My Drive/darknet/backup/yolov3-PID_final.weights" "/content/gdrive/My Drive/darknet/img/MN 111-0-515 (45).jpg" -dont-show
However, I got the following error:
Error: in the file data/coco.names number of names 80 that isn't equal to classes=13 in the file /content/gdrive/My Drive/darknet/cfg/yolov3-PID.cfg
Even, the resulting image didn't contain any bounding boxes, so I don't know if the training worked or not.
Could you pls advise what might be wrong with the training, and why the error is referring to coco.names, while I'm using other files for names, and configuration?
You did not share the yolov3-PID.cfg, obj.data and coco.names. I am assuming coco.names contain 80 classes as in the repo.
The error likely is in obj.data, where it seems your goal here is to detect 13 custom objects. If this is the case, then set classes=13, also replace names=data/coco.names with names=data/obj.names. Here, obj.names file should contain 13 lines for the custom class names. Also modify yolov3-PID.cfg to contain same amount of classes.
I suggest using this repo below if you are not already using this. It contains google colab training and inference script for yolov3, yolov4.
Here are the instructions for custom object detection training.
Nice work!!! coming this far. Well, everything is fine, you just need to edit the data folder of the darknet. By default it's using coco label, go to darknet folder --> find data folder --> coco.names file --> edit the file by removing 80 classes(in colab just double click to edit and ctrl+s to save) --> Put down your desired class and it's done!!!
i was having the same problem when i was training custom model in colab.
i just cloned darknet again in another folder and edited coco.name and moved it to my training folder. and it worked!!

GluonCV inference with finetuned model - “Please make sure source and target networks have the same prefix” error

I used GluonCV to finetune an object detection model in order to recognize some custom classes, mostly following the related tutorial.
I tried using both “ssd_512_resnet50_v1_coco” and “ssd_512_mobilenet1.0_coco” as base models, and the training process ended successfully (the accuracy on the validation dataset is reasonably high).
The problem is, I tried running inference with the newly trained model, by using for example:
classes = ["CML_mug", "person"]
net = gcv.model_zoo.get_model('ssd_512_mobilenet1.0_custom',
classes=classes,
pretrained_base=False,
ctx=ctx)
net.load_params("saved_weights/-0070.params", ctx=ctx)
but I get the error:
AssertionError: Parameter 'mobilenet0_conv0_weight' is missing in file: saved_weights/CML_mobilenet_00/-0000.params, which contains parameters: 'ssd0_ssd0_mobilenet0_conv0_weight', 'ssd0_ssd0_mobilenet0_batchnorm0_gamma', 'ssd0_ssd0_mobilenet0_batchnorm0_beta', ..., 'ssd0_ssd0_ssdanchorgenerator2_anchor_2', 'ssd0_ssd0_ssdanchorgenerator3_anchor_3', 'ssd0_ssd0_ssdanchorgenerator4_anchor_4', 'ssd0_ssd0_ssdanchorgenerator5_anchor_5'. Please make sure source and target networks have the same prefix.
So, it seems the network parameters are named differently in the .params file and in the model I’m using for inference. Specifically, in the .params file, the name of the network weights is prefixed by the string “ssd0_ssd0_”, which lead to the error when invoking net.load_parameters.
I did this whole procedure a few times in the past without having problems, did anything change? I’m running it on Ubuntu 18.04, with mxnet-mkl (1.6.0) and gluoncv (0.7.0).
I tried loading the .params file by:
from mxnet import nd
model = nd.load(0070.param)
and I wanted to modify it and remove the “ssd0_ssd0_” string that is causing the problem.
I’m trying to navigate the dictionary, but between the keys I only found a:
ssd0_resnetv10_conv0_weight
so, slightly different than indicated in the error.
Anyway, this way of fixing the issue would be a little cumbersome, I’d prefer a more direct way.
Ok, fixed it. Basically, during training I was saving the .params file by using:
net.export(param_file)
and, as I said, loading them during inference by:
net.load_parameters(param_file)
However, it doesn’t work this way, but it does if instead of export I use:
net.save_parameters(param_file)

How to store best models checkpoints, not only newest 5, in Tensorflow Object Detection API?

I'm training MobileNet on WIDER FACE dataset and I encountered problem I couldn't solve. TF Object Detection API stores only last 5 checkpoints in train dir, but what I would like to do, is to save best models relative to mAP metric (or at least leave many more models in train dir before deletion).
For example, today I've looked at Tensorboard after next night of training and I see that overnight model has over-fitted and I can't restore best checkpoint, because it was deleted already.
EDIT: I just use Tensorflow Object Detection API, it by default saves last 5 checkpoints in train dir which I point. I look for some configuration parameter or anything that will change this behavior.
Has anyone have some fix in code/config param to set/workaround for that? It seems like I'm missing something, it should be obvious that what's in fact important is the best model, not the newest one (which can overfit).
Thanks!
You can modify (hardcoding in your fork or opening a pull request and adding the options to protos) the arguments passed to tf.train.Saver in:
https://github.com/tensorflow/models/blob/master/research/object_detection/legacy/trainer.py#L376-L377
You will probably want to set:
max_to_keep: Maximum number of recent checkpoints to keep. Defaults to 5.
keep_checkpoint_every_n_hours: How often to keep checkpoints. Defaults to 10,000 hours.
You can change config.
in run_config.py
class RunConfig(object):
"""This class specifies the configurations for an `Estimator` run."""
def __init__(self,
model_dir=None,
tf_random_seed=None,
save_summary_steps=100,
save_checkpoints_steps=_USE_DEFAULT,
save_checkpoints_secs=_USE_DEFAULT,
session_config=None,
keep_checkpoint_max=10,
keep_checkpoint_every_n_hours=10000,
log_step_count_steps=100,
train_distribute=None,
device_fn=None,
protocol=None,
eval_distribute=None,
experimental_distribute=None):
You may be interested by this Tf github thread that tackles the newest/best checkpoint issue. A user developed his own wrapper, chekmate, around tf.Saver to keep track of the best checkpoints.
You can follow up this PR. Here your best checkpoint is saved within your checkpoint directory, sub-directory named as best.
You just need to integrate the best_saver() and (method call in _run_checkpoint_once()) inside ../object_detection/eval_util.py
Additionally it will also create a json for all_evaluation_metrices.
For saving more checkpoints, you can write a simple python script that will store the checkpoints in a timely manner to a specific.
import os
import shutil
import time
while True:
training_file = '/home/vignesh/training' # path of your train directory
archive_file = 'home/vignesh/training/archive' #path of the directory where you want to save your checkpoints
files_to_save = []
for files in os.listdir(training_file):
if files.rsplit('.')[0]=='model':
files_to_save.append(files)
for files in files_to_save:
if files in os.listdir(archive_file):
pass
else:
shutil.copy2(training_file+'/'+files,archive_file)
time.sleep(600) # This will make the script run for every 600 seconds, modify it for your need

Tensorflow Lite export looks like it do not add weigths and add unsupported operations

I want to reload some of my model variables with the saved weight in the chheckpoint and then export it to the tflite file.
The question is a bit tricky without see code, so I made this Colab jupyter notebook with the complete code to explain it better (All code is working, you can actually copy in a new collab and change if you want):
https://colab.research.google.com/drive/1wSor4CxEz36LgElVi4y_N8uiSt4-j9b2#scrollTo=XKBQzoW_wd4A
I got it working but with two issues:
The exported .tflite file is like 3Ks, so I do not believe it is the entire model with the weights in it. Only the input is an image of 128x128x3, one weight for each is more than 3K.
When I finally import the model in Android, I have this error: "Didn't find custom op for name 'VariableV2' /n Didn't find custom op for name 'ReorderAxes' /n Registration failed."
Maybe the last error is cause the save/restore operations? They look like are there when I save the graph definition.
Thanks in advance.
I realize my problem.. I'm trying to convert to TFLITE a model without previously freezing it, TFLITE do not allow "VariableV2" nodes cause they should not be there..
All the problem is corrected freezing the model like this:
output_graph_def = graph_util.convert_variables_to_constants(sess, sess.graph.as_graph_def(), ["output"])
I lost some time looking for that, hope it helps.