MXnet fine-tune save model - mxnet

I'm using mxnet's fine-tune example to fine-tune my own data with this code:
https://github.com/dmlc/mxnet/blob/master/example/image-classification/fine-tune.py
By viewing common/fit.py, I got no idea of how to save temp model when I fine tuning.
For example, I wanna save .params files every 5000 iters, how can I do it?
THX!

http://mxnet.io/api/python/callback.html
Try to use the mx.callback API.
module.fit(iterator, num_epoch=n_epoch,
... epoch_end_callback = mx.callback.do_checkpoint("mymodel", 1))
Start training with [cpu(0)]
Epoch[0] Resetting Data Iterator
Epoch[0] Time cost=0.100
Saved checkpoint to "mymodel-0001.params"
Epoch[1] Resetting Data Iterator
Epoch[1] Time cost=0.060
Saved checkpoint to "mymodel-0002.params"

Related

Saving and Loading CNN model

I have a CNN model and I want to save it and load it for prediction in different tab. But I am confused whether the model.evulotion part is included in the part I will save. And I don't know if it would be better to use Model.checkpoint or model.save to save and load. Is there anyone have an idea ? Thank you in advance
I'm in dilemma about using both of them so I've use it.
Using model.eval() just tells the PyTorch model to use mean values for batch normalisation, and deactivates the dropout layers. You can save your model without using model.eval() as it will not affect the performance.
While saving model, saving model's state dictionary is preferred. This can be done as shown:
#declare class of model here
model = NeuralNetwork()
#add training code below
...
#saving model, the model will be saved at the intermediateWeightPath location
intermediateWeightPath = "./bestmodel.pth"
torch.save(model.state_dict(), intermediateWeightPath)

INFO:tensorflow:Waiting for new checkpoint at models/faster_rcnn

I used the transfer learning approach to develop a detection model using the faster_rcnn algorithm.
To evaluate my model, I used the following commands-
!python model_main_tf2.py --model_dir=models/faster_rcnn_inception_resnet_v2 --pipeline_config_path=models/faster_rcnn_inception_resnet_v2/pipeline.config --checkpoint_dir=models/faster_rcnn_inception_resnet_v2
However, I have been getting the following error/info message: -
INFO:tensorflow:Waiting for new checkpoint at models/faster_rcnn_inception_resnet_v2
I0331 23:23:11.699681 140426971481984 checkpoint_utils.py:139] Waiting for new checkpoint at models/faster_rcnn_inception_resnet_v2
I checked the path to the checkpoint_dir is correct. What could be the problem and how can I resolve it?
Thanks in advance.
You need to run another script for training to generate new checkpoint. model_main_tf2.py does not do both at once, i.e., it won't train model and evaluate the model at the end of each epoch.
One way to get what you want modifying checkpoint_max_to_keep in https://github.com/tensorflow/models/blob/13ec3c1460b928301d208115aed0c94fb47538b7/research/object_detection/model_lib_v2.py#L445
to keep all checkpoints, then evaluate separately. This does not work exactly same as you want, but it generates the curves.
A similar situation happened to me. I don't know if this is the solution or just a workaround but it did work for me. I simply exported my model and provided the path to that checkpoint folder.
fintune_checkpoint_model_directory
|
\---checkpoint(folder)
|
\---checkpoint(file with no extension)
\---ckpt-1.data0000of0001
\---ckpt-1.index
and then simply run the model_main_tf.py file for evaluation.
if you trained your model with a few number of steps it can be a problem maybe a few checkpoints can affect that TensorFlow can't generate evaluation so try to increase the number of steps

Cannot load model weights in TensorFlow 2

I cannot load model weights after saving them in TensorFlow 2.2. Weights appear to be saved correctly (I think), however, I fail to load the pre-trained model.
My current code is:
segmentor = sequential_model_1()
discriminator = sequential_model_2()
def save_model(ckp_dir):
# create directory, if it does not exist:
utils.safe_mkdir(ckp_dir)
# save weights
segmentor.save_weights(os.path.join(ckp_dir, 'checkpoint-segmentor'))
discriminator.save_weights(os.path.join(ckp_dir, 'checkpoint-discriminator'))
def load_pretrained_model(ckp_dir):
try:
segmentor.load_weights(os.path.join(ckp_dir, 'checkpoint-segmentor'), skip_mismatch=True)
discriminator.load_weights(os.path.join(ckp_dir, 'checkpoint-discriminator'), skip_mismatch=True)
print('Loading pre-trained model from: {0}'.format(ckp_dir))
except ValueError:
print('No pre-trained model available.')
Then I have the training loop:
# training loop:
for epoch in range(num_epochs):
for image, label in dataset:
train_step()
# save best model I find during training:
if this_is_the_best_model_on_validation_set():
save_model(ckp_dir='logs_dir')
And then, at the end of the training "for loop", I want to load the best model and do a test with it. Hence, I run:
# load saved model and do a test:
load_pretrained_model(ckp_dir='logs_dir')
test()
However, this results in a ValueError. I checked the directory where the weights should be saved, and there they are!
Any idea what is wrong with my code? Am I loading the weights incorrectly?
Thank you!
Ok here is your problem - the try-except block you have is obscuring the real issue. Removing it gives the ValueError:
ValueError: When calling model.load_weights, skip_mismatch can only be set to True when by_name is True.
There are two ways to mitigate this - you can either call load_weights with by_name=True, or remove skip_mismatch=True depending on your needs. Either case works for me when testing your code.
Another consideration is that you when you store both the discriminator and segmentor checkpoints to the log directory, you overwrite the checkpoint file each time. This contains two strings that give the path to the specific model checkpoint files. Since you save discriminator second, every time this file will say discriminator with no reference to segmentor. You can mitigate this by storing each model in two subdirectories in the log directory instead, i.e.
logs_dir/
+ discriminator/
+ checkpoint
+ ...
+ segmentor/
+ checkpoint
+ ...
Although in the current state your code would work in this case.

Can I continue to training from final .weight with more train and test images?

I trained my custom object detection with darknet yolov3 untill the average loss decreased down to 0.06 but now i want to train it with more training and test images (maybe also deleting some of the image files). Can I do these steps and continue to training with final .weights file or I should start it from the beginning?
Yes, you can use the currently trained model (.weights file) as the pre-trained model for the new training session. For example, if you use AlexeyAB repository you can train your model by a command like this:
darknet.exe detector train data/obj.data yolo-obj.cfg darknet53.conv.74
where darknet53.conv.74 is the pre-trained model.
In the new training session, you can add or remove images. However, the basic configurations should be correct (like the number of classes, etc).
According to the page I mentioned:
in the original repository original repository the
weights-file is saved only once every 10 000 iterations
If you have just modified the data set, but are not interested in changing the model architecture,you can directly resume from the previously saved model using DarkNet in AlexeyAB/darknet. For example,
darknet.exe detector train cfg/obj.data cfg/yolov3.cfg yolov3_weights_last.weights -clear -map
The clear flag will reset iterations saved in the weights, which is appropriate in case of data set changes. That is because the learning rate often depends on the iterations, and you probably don't want to change the configurations.
You need to specify more epochs if you resume. For example if you train to 300/300 then resume will also train to 300 also (starting at 300) unless you specify more epochs..
python train.py --resume
you can resume your training from the previously saved weights, of your custom model.
use the "yolov3_custom_last.weights" instead of the pre-trained default weights.
Incase you find some issues with resuming, try changing the batch size .
this should work and resume your model training with new set of images :)
open the .cfg, find the max_batches code may be in 22 row, set the bigger value:
max_batches = 500200
max_batches is the same to the tranning iteration.
How to continute training after 50000 iteration? #2633

How to use model.ckpt and inception v-3 to predict images?

Now I'm in front of the problem about inception v-3 and checkpoint data.
I have been tackling with updating inception-v3's checkpoint data by my images, reading the git page below and succeeded to make new checkpoint data.
https://github.com/tensorflow/models/tree/master/inception
I thought at first just by little change of the code, I can use those checkpoint data to recognise new image datas like the below url.
https://www.tensorflow.org/versions/master/tutorials/image_recognition/index.html
I thought at first that "classify.py" or something reads the new check point datas and just by "python classify.py -image something.png", the program recognises the image data. But It doesn't....
I really need a help.
thanks.
To have input .pb file, during training, import also tf.train.write_graph(sess.graph.as_graph_def(), 'path_to_folder', 'input_graph.pb',False)
If you have downloaded the inception v3 source code, in the inception_train.py, add the line I wrote above, under
saver.save(sess, checkpoint_path, global_step=step). (Where you save the checkpoint/s)
Hope this helps!
To use your checkpoints and model in something like the label_image example, you'll need to run the tensorflow/python/tools/freeze_graph script to convert your variables into constants stored inside the GraphDef. That's how we created the graph file used in that sample code, for example.