YOLO darknet freezes when I start training my model

YOLO darknet freezes when I start training my model - google-colaboratory

I am training a model with yolo darknet in google colab but when I start the training the page freezes and a pop-up window appears that the web page does not respond
I don't know if it is because the model has many classes to train and the page collapses
here is my code:
!apt-get update
!unzip "/content/drive/My Drive/custom_dib_model/darknet.zip"
!sudo apt install dos2unix
!find . -type f -print0 | xargs -0 dos2unix
!chmod +x /content/darknet
!make
!./darknet detector test cfg/coco.data cfg/yolov4.cfg yolov4.weights data/person.jpg
!rm /content/darknet/backup -r
!ln -s /content/drive/'My Drive'/dib_weights/backup /content/darknet
!./darknet detector train dibujos_dataset/dib.data dib_yolov4.cfg yolov4.conv.137 -map -dont_show
The last line is the one that begins the training of my model and about 5 min pass when the page freezes, it should be noted that no error appears
I found a similar question but there is no concrete answer
possible answer by the user
This is all the information that I can give you and I hope it is enough

I solved it by changing the YOLO configuration dib_yolo.cfg file modifying the subdivisions from 64 to 16

Related

Stuck at training model with CPU

As the example points out:
docker run -it -p 8500:8500 --gpus all tensorflow/serving:latest-devel
should train the mnist mode, however I want to use intel cpu for training, not gpu. But no luck, it stucked at Training model...
Here is the command I used:
docker run -it -p 8500:8500 tensorflow/serving:latest-devel

I found out that it will download resources at first, which a proxy is needed sometimes.

freeze model for inference with output_node_name for ssd mobilenet v1 coco

I want to compile the TensorFlow Graph to Movidius Graph. I have used Model Zoo's ssd_mobilenet_v1_coco model to train it on my own dataset.
Then I ran
python object_detection/export_inference_graph.py \
--input_type=image_tensor \
--pipeline_config_path=/home/redtwo/nsir/ssd_mobilenet_v1_coco.config \
--trained_checkpoint_prefix=/home/redtwo/nsir/train/model.ckpt-3362 \
--output_directory=/home/redtwo/nsir/output
which generates me frozen_interference_graph.pb & saved_model/saved_model.pb
Now to convert this saved model into Movidius graph. There are commands given
Export GraphDef file
python3 ../tensorflow/tensorflow/python/tools/freeze_graph.py \
--input_graph=inception_v3.pb \
--input_binary=true \
--input_checkpoint=inception_v3.ckpt \
--output_graph=inception_v3_frozen.pb \
--output_node_name=InceptionV3/Predictions/Reshape_1
Freeze model for inference
python3 ../tensorflow/tensorflow/python/tools/freeze_graph.py \
--input_graph=inception_v3.pb \
--input_binary=true \
--input_checkpoint=inception_v3.ckpt \
--output_graph=inception_v3_frozen.pb \
--output_node_name=InceptionV3/Predictions/Reshape_1
which can finally be feed to NCS Intel Movidius SDK
mvNCCompile -s 12 inception_v3_frozen.pb -in=input -on=InceptionV3/Predictions/Reshape_1
All of this is given at Intel Movidius Website here: https://movidius.github.io/ncsdk/tf_modelzoo.html
My model was already trained i.e. output/frozen_inference_graph. Why do I again freeze it using /slim/export_inference_graph.py or it's the output/saved_model/saved_model.py that will go as input to slim/export_inference_graph.py??
All I want is output_node_name=Inceptionv3/Predictions/Reshape_1. How to get this output_name_name directory structure & anything inside it? I don't know what all it contains
what output node should I use for model zoo's ssd_mobilenet_v1_coco model(trained on my own custom dataset)
python freeze_graph.py \
--input_graph=/path/to/graph.pbtxt \
--input_checkpoint=/path/to/model.ckpt-22480 \
--input_binary=false \
--output_graph=/path/to/frozen_graph.pb \
--output_node_names="the nodes that you want to output e.g. InceptionV3/Predictions/Reshape_1 for Inception V3 "
Things I understand & don't understand:
input_checkpoint: ✓ [check points that were created during training]
output_graph: ✓ [path to output frozen graph]
out_node_names: X
I don't understand out_node_names parameter & what should inside this considering its ssd_mobilnet not inception_v3
System information
What is the top-level directory of the model you are using:
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): TensorFlow installed with pip
TensorFlow version (use command below): 1.13.1
Bazel version (if compiling from source):
CUDA/cuDNN version: V10.1.168/7.*
GPU model and memory: 2080Ti 11Gb
Exact command to reproduce:

The graph in saved_model/saved_model.pb is the graph definition(graph architecture) of the pretrained inception_v3 model without the weights loaded to the graph. The frozen_interference_graph.pb is the graph frozen with the checkpoints you have provided and taking the default output nodes of the inception_v3 model.
To get output node names summarise_graph tool can be used
You can use the below commands to use summarise_graph tool if bazel is installed
bazel build tensorflow/tools/graph_transforms:summarize_graph
bazel-bin/tensorflow/tools/graph_transforms/summarize_graph \
--in_graph=/tmp/inception_v3_inf_graph.pb
In case if bazel is not installed Output nodes can be obtained using the tensorboard or any other graph visualising tools like Netron.
The additional freeze_graph.py can be used to freeze the graph specifying the output nodes(ie in a case where additional output nodes are added to the inceptionV3). The frozen_interference_graph.pb is also an equaly good fit for infrencing.

Tensorflow serving No versions of servable <MODEL> found under base path

I was following this tutorial to use tensorflow serving using my object detection model. I am using tensorflow object detection for generating the model. I have created a frozen model using this exporter (the generated frozen model works using python script).
The frozen graph directory has following contents ( nothing on variables directory)
variables/
saved_model.pb
Now when I try to serve the model using the following command,
tensorflow_model_server --port=9000 --model_name=ssd --model_base_path=/serving/ssd_frozen/
It always shows me
...
tensorflow_serving/model_servers/server_core.cc:421] (Re-)adding
model: ssd 2017-08-07 10:22:43.892834: W
tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:262]
No versions of servable ssd found under base path /serving/ssd_frozen/
2017-08-07 10:22:44.892901: W
tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:262]
No versions of servable ssd found under base path /serving/ssd_frozen/
...

I had same problem, the reason is because object detection api does not assign version of your model when exporting your detection model. However, tensorflow serving requires you to assign a version number of your detection model, so that you could choose different versions of your models to serve. In your case, you should put your detection model(.pb file and variables folder) under folder:
/serving/ssd_frozen/1/. In this way, you will assign your model to version 1, and tensorflow serving will automatically load this version since you only have one version. By default tensorflow serving will automatically serve the latest version(ie, the largest number of versions).
Note, after you created 1/ folder, the model_base_path is still need to be set to --model_base_path=/serving/ssd_frozen/.

For new version of tf serving, as you know, it no longer supports the model format used to be exported by SessionBundle but now SavedModelBuilder.
I suppose it's better to restore a session from your older model format and then export it by SavedModelBuilder. You can indicate the version of your model with it.
def export_saved_model(version, path, sess=None):
tf.app.flags.DEFINE_integer('version', version, 'version number of the model.')
tf.app.flags.DEFINE_string('work_dir', path, 'your older model directory.')
tf.app.flags.DEFINE_string('model_dir', '/tmp/model_name', 'saved model directory')
FLAGS = tf.app.flags.FLAGS
# you can give the session and export your model immediately after training
if not sess:
saver = tf.train.import_meta_graph(os.path.join(path, 'xxx.ckpt.meta'))
saver.restore(sess, tf.train.latest_checkpoint(path))
export_path = os.path.join(
tf.compat.as_bytes(FLAGS.model_dir),
tf.compat.as_bytes(str(FLAGS.version)))
builder = tf.saved_model.builder.SavedModelBuilder(export_path)
# define the signature def map here
# ...
legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
builder.add_meta_graph_and_variables(
sess, [tf.saved_model.tag_constants.SERVING],
signature_def_map={
'predict_xxx':
prediction_signature
},
legacy_init_op=legacy_init_op
)
builder.save()
print('Export SavedModel!')
you could find main part of the code above in tf serving example.
Finally it will generate the SavedModel in a format that can be served.

Create a version folder under like - serving/model_name/0000123/saved_model.pb
Answer's above already explained why it is important to keep a version number inside the model folder. Follow below link , here they have different sets of built models , you can take it as a reference.
https://github.com/tensorflow/serving/tree/master/tensorflow_serving/servables/tensorflow/testdata

I was doing this on my personal computer running Ubuntu, not Docker. Note I am in a directory called "serving". This is where I saved my folder "mobile_weight". I had to create a new folder, "0000123" inside "mobile_weight". My path looks like serving->mobile_weight->0000123->(variables folder and saved_model.pb)
The command from the tensorflow serving tutorial should look like (Change model_name and your directory):
nohup tensorflow_model_server \
--rest_api_port=8501 \
--model_name=model_weight \
--model_base_path=/home/murage/Desktop/serving/mobile_weight >server.log 2>&1
So my entire terminal screen looks like:
murage#murage-HP-Spectre-x360-Convertible:~/Desktop/serving$ nohup tensorflow_model_server --rest_api_port=8501 --model_name=model_weight --model_base_path=/home/murage/Desktop/serving/mobile_weight >server.log 2>&1

That error message can also result due to issues with the --volume argument.
Ensure your --volume mount is actually correct and points to the model's dir, as this is a general 'model not found' error, but it just seems more complex.
If on windows just use cmd, otherwise its easy to accidentally use linux file path and linux separators in cygwin or gitbash. Even with the correct file structure you can get OP's error if you don't use the windows absolute path.
#using cygwin
$ echo $TESTDATA
/home/username/directory/serving/tensorflow_serving/servables/tensorflow/testdata
$ docker run -t --rm -p 8501:8501 -v "$TESTDATA/saved_model_half_plus_two_cpu:/models/half_plus_two" -e MODEL_NAME=half_plus_two tensorflow/serving
2021-01-22 20:12:28.995834: W tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:267] No versions of servable half_plus_two found under base path /models/half_plus_two. Did you forget to name your leaf directory as a number (eg. '/1/')?
Then calling the same command with the same unchanged file structure but with the full windows path using windows file separators, and it works:
#using cygwin
$ export TESTDATA="$(cygpath -w "/home/username/directory/serving/tensorflow_serving/servables/tensorflow/testdata")"
$ echo $TESTDATA
C:\Users\username\directory\serving\tensorflow_serving\servables\tensorflow\testdata
$ docker run -t --rm -p 8501:8501 -v "$TESTDATA\\saved_model_half_plus_two_cpu:/models/half_plus_two" -e MODEL_NAME=half_plus_two tensorflow/serving
2021-01-22 21:10:49.527049: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: half_plus_two version: 1}

Inception v3 flowers_train using the checkpoint model

I have used the flowers_train script found here: flowers_train.py.
To retrain the existing inception v3 model on 10 new classes. The flowers_train script generates some checkpoint files of the format:
checkpoint model.ckpt-1030000.index
events.out.tfevents.1501217995.tron model.ckpt-1030000.meta
model.ckpt-1020000.data-00000-of-00001 model.ckpt-1035000.data-00000-of-00001
model.ckpt-1020000.index model.ckpt-1035000.index
model.ckpt-1020000.meta model.ckpt-1035000.meta
model.ckpt-1025000.data-00000-of-00001 model.ckpt-1040000.data-00000-of-00001
model.ckpt-1025000.index model.ckpt-1040000.index
model.ckpt-1025000.meta model.ckpt-1040000.meta
model.ckpt-1030000.data-00000-of-00001
The classify_image.py script is found here.
It expects a .pb file, not a checkpoint file.
I've been pulling my hair over the past two weeks trying to figure out how to get from the checkpoint file to the .pb file so I can use the retrained model.
Any ideas would be appreciated.

Any ideas would be appreciated.
The code at https://www.tensorflow.org/hub/tutorials/image_retraining
cd ~
curl -LO http://download.tensorflow.org/example_images/flower_photos.tgz
tar xzf flower_photos.tgz
mkdir ~/example_code
cd ~/example_code
curl -LO https://github.com/tensorflow/hub/raw/master/examples/image_retraining/retrain.py
python retrain.py --image_dir ~/flower_photos
produces the file ./output_graph.pb.

How do I get the 'checkpoint' of a retrained inception-v3 model?

I'm trying use a retrained inception-v3 model in tensorflow-serving. But it seems that I have to provide a 'checkpoint'. I was wondering how do I get those 'checkpoints'? The retrain.py returns me a retrained_graph.pb. I followed this tutorial (https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/#0)
Thank you!

You may want to look into the API changes for Tensorflow 1.0
https://www.tensorflow.org/install/migration
to make it work to create a checkpoint file/s instead of .pb file.
Also refer to:
https://www.tensorflow.org/tutorials/image_retraining

Here is an example of how to download the checkpoints for the latest edition of pre-trained Inception V3 model:
$ CHECKPOINT_DIR=/tmp/checkpoints
$ mkdir ${CHECKPOINT_DIR}
$ wget http://download.tensorflow.org/models/inception_v3_2016_08_28.tar.gz
$ tar -xvf inception_v3_2016_08_28.tar.gz
$ mv inception_v3.ckpt ${CHECKPOINT_DIR}
$ rm inception_v3_2016_08_28.tar.gz

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

YOLO darknet freezes when I start training my model - google-colaboratory

I solved it by changing the YOLO configuration dib_yolo.cfg file modifying the subdivisions from 64 to 16

Related

Stuck at training model with CPU

freeze model for inference with output_node_name for ssd mobilenet v1 coco

Tensorflow serving No versions of servable <MODEL> found under base path

Inception v3 flowers_train using the checkpoint model

How do I get the 'checkpoint' of a retrained inception-v3 model?

Categories

Resources