Map/graph not appearing when training on darknet? - yolo

I am currently training on Darknet YOLO using AlexeyAB's version for linux on the master branch (https://github.com/AlexeyAB/darknet) and the map/graph does not appear in a second, separate window.
My makefile is original except that I changed GPU = 1.
I configured my .cfg, .data, .names and weights files and I am able to successfully train on 2000 iterations, however, the map/graph does not appear when I begin training and as a result, I have to take screenshots of my terminal every now-and-then to make sure training is going well.
Here is the command I use to train:
$ ./darknet detector train data/obj.data yolo-obj.cfg darknet53.conv.74
I have tried adding the -map flag to the end as so:
$ ./darknet detector train data/obj.data yolo-obj.cfg darknet53.conv.74 -map
but it still does not appear. Am I missing something in my command or in my config?
Thank you in advance!

In yolov4 you can view the training graph by adding the argument -mjpeg_port 8040 (assign your port). Using your default browser, you can now view a static graph on using your http://ipaddress:8040.

The chart should appear after the first 1000 iterations. However, this may not be the case if you set you max_batches > 10000. This is because the weights are no longer saved at the 1000 iteration intervals.
Please double check your max_batches setting and also try waiting for 1000 iterations.

Related

Trained YOLO, which iteration does yolo_best.weights have?

I trained YOLOv3 via the Darknet framework. Every 1000 iteration it saved the weights but at the end, Darknet evaluates all the weights, and uses the best. They are saved in a separate file "yolov3_best.weights".
I want to find out, which iteration was used for this file. I tried so far:
use the weights in a recognition test via terminal and checked the output
opened the best.weights-file via Editor and searched for it
but I couldn't find it.
Does anyone have a solution?
Thanks upfront.
so, it is not clear how to find out which iteration step / epoch was used for the best_weights-file, so I did the following:
I wrote a script, which uses the "yolov3_best.weights"-file to detect all classes in the testset, compared that with my label-data and calculated the metrics recall, precision and f1.
I did this also for all the other weights, that darknet saves by default (every 1000 iteration steps) and compared the results of.
At the end I found out, that the "yolov3_best.weights" is not the best for my metrics, so I choose the one with the highest recall value (but others may choose according to the metric that has to be optimized for the case neccessary).
Hope this helpes others.
From AlexeyAB wiki there are two ways of doing this
Test the various weights in backup/ using the map recall flag.
Example:
./detector map data/obj.data yolo-obj.cfg backup/yolo-obj_7000.weights
Do this for all the weights and pick one with the highest mAP (mean average precision) or IoU (intersect over union)
Train with the -map flag
Example: ./detector detector train data/obj.data yolo-obj.cfg yolov4.conv.137 -map
mAP will be calculated for every 4 Epochs using valid=valid.txt file that is specified in data/obj.data

Can I continue to training from final .weight with more train and test images?

I trained my custom object detection with darknet yolov3 untill the average loss decreased down to 0.06 but now i want to train it with more training and test images (maybe also deleting some of the image files). Can I do these steps and continue to training with final .weights file or I should start it from the beginning?
Yes, you can use the currently trained model (.weights file) as the pre-trained model for the new training session. For example, if you use AlexeyAB repository you can train your model by a command like this:
darknet.exe detector train data/obj.data yolo-obj.cfg darknet53.conv.74
where darknet53.conv.74 is the pre-trained model.
In the new training session, you can add or remove images. However, the basic configurations should be correct (like the number of classes, etc).
According to the page I mentioned:
in the original repository original repository the
weights-file is saved only once every 10 000 iterations
If you have just modified the data set, but are not interested in changing the model architecture,you can directly resume from the previously saved model using DarkNet in AlexeyAB/darknet. For example,
darknet.exe detector train cfg/obj.data cfg/yolov3.cfg yolov3_weights_last.weights -clear -map
The clear flag will reset iterations saved in the weights, which is appropriate in case of data set changes. That is because the learning rate often depends on the iterations, and you probably don't want to change the configurations.
You need to specify more epochs if you resume. For example if you train to 300/300 then resume will also train to 300 also (starting at 300) unless you specify more epochs..
python train.py --resume
you can resume your training from the previously saved weights, of your custom model.
use the "yolov3_custom_last.weights" instead of the pre-trained default weights.
Incase you find some issues with resuming, try changing the batch size .
this should work and resume your model training with new set of images :)
open the .cfg, find the max_batches code may be in 22 row, set the bigger value:
max_batches = 500200
max_batches is the same to the tranning iteration.
How to continute training after 50000 iteration? #2633

I trained a neural network but I cannot find where it got saved, I cannot find any .meta , .index, .data files

I was trying to follow this page https://www.tensorflow.org/tutorials/sequences/audio_recognition
I successfully executed the following command:
python tensorflow/examples/speech_commands/train.py
I used a virtual environment in Anaconda. Used Tensorflow 14 and Python 3.6
It took about about 22 hours to train it. it said "/tmp/speech_commands_train/conv.ckpt-100" after every 100 iterations
(there were 18000 in total)
but now when I try to find conv.ckpt-18000.meta or just speech_commands_train I cannot find it.
I am very new to this. This is my first effort in deep learning.
how the terminal looked when training ended
Firstly, what you mean by " Where It saved", by it you mean logs, the trained model or weights.
In your case, you are just storing the weights at given checkpoints hence you can acess them at given paths said in the tutorial
I0730 16:54:41.813438 55030 train.py:252] Saving to "/tmp/speech_commands_train/conv.ckpt-100"
*This is saving out the current trained weights to a checkpoint file. If your training script gets interrupted, you can look for the last saved checkpoint and then restart the script with -*
Also you can store logs using file writer and model using save_model or tensorboard callback with logdir.
Don't forget to upvote if found it useful

Using ssd_inception_v2 to train on different resolution

The dataset contains images of different sizes.
The pretrained weights are trained on 300x300 resolution.
I am training on widerface dataset where objects are as small as 15x15.
Q1. I want to train with 800x800 resolution do i need to resize all the images manually or this will be done by Tensorflow automatically ?
I am using the following command to train:
python3 /opt/github/models/research/object_detection/legacy/train.py --logtostderr --train_dir=/opt/github/object_detection_retraining/wider_face_checkpoint/ --pipeline_config_path=/opt/github/object_detection_retraining/models/ssd_inception_v2_coco_2018_01_28/pipeline.config
Q2. I also tried training it using the model_main.py but after 1000 iterations it is evaluating the dataset with each iteration.
I am using the following command to train:
python3 /opt/github/models/research/object_detection/model_main.py --num_train_steps=200000 --logtostderr --model_dir=/opt/github/object_detection_retraining/wider_face_checkpoint/ --pipeline_config_path=/opt/github/object_detection_retraining/models/ssd_inception_v2_coco_2018_01_28/pipeline.config
Q3. Also if you can suggest any model i should use for real time face detection apart from mobilenet and inception, please suggest.
Thanks.
Q1. No you do not need to resize manually. See this detailed answer.
Q2. By 1000 iterations you meant steps right? (An iteration counts as a complete cycle of the dataset.) Usually the model performed evaluation after a certain amount of time, e.g. 10 minutes. So in every 10 minutes, the checkpoints are saved and an evaluation of the model on evaluation set is performed.
Q3. SSD models with mobilenet is one of the fast detectors, apart from that you can try YOLO models for real time detection

Train Tensorflow with my own images successfully, but still have problems

I am using ubuntu 16.04, with GPU Geforce 1080, 8 GB GPU memory.
I have properly created TF-record files, and I trained the model successfully. However I still have two problems.
I did the following steps and I still have two problems, just tell me please what I am missing:-
I used VOCdevkit and I properly created two files which are:- pascal_train.record and pascal_val.record
Then,
1- From this link, I used the raccoon images, I placed them into the following directory models/object_detection/VOCdevkit/VOC2012/JPEGImages (after I deleted the previous images).
Then, I used the raccoon annotation, I placed them into the following directory models/object_detection/VOCdevkit/VOC2012/Annotation (after I deleted the previous ones).
2- I modified the models/object_detection/data/pascal_label_map.pbxt and I wrote one class name which is 'raccoon'
3- I used ssd_mobilenet_v1_pets.config. I modified it, the number of class is only one and I did not train from scratch, I used ssd_mobilenet_v1_coco_11_06_2017/model.ckpt
fine_tune_checkpoint: "/home/jesse/abdu-py2/models/model/ssd_mobilenet_v1_coco_11_06_2017/model.ckpt"
from_detection_checkpoint: true
4- From this link I arrange my data structure which is like that:-
models
1.1 model
1.1.1 ssd_mobilenet_v1_pets.config
1.1.2 train
1.1.3 evaluation
1.1.4 ssd_mobilenet_v1_coco_11_06_2017/model.ckpt
1.2 object_detection
1.2.1 data that contains (pascal_train.record, pascal_val.record, and pascal_label_map.pbtxt)
1.2.2 VOCdevkit
1.2.2.1 VOC2012
1.2.2.1.1 JPEGImages (my own images)
1.2.2.1.2 Annotations (raccoon annotation)
1.2.2.1.3 ImageSets
1.2.2.1.3.1 Main (raccoon_train.txt,raccoon_val.txt,raccoon_train_val.txt)
5- Now, I will train my model
(abdu-py2) jesse#jesse-System-Product-Name:~/abdu-py2/models$ python object_detection/train.py --logtostderr --pipeline_config_path=/home/jesse/abdu-py2/models/model/ssd_mobilenet_v1_pets.config --train_dir=/home/jesse/abdu-py2/models/model/train
Every thing looks fine, I created it many files like checkpoint and events.out.tfevents.1503337171 file (and others) after many thousands of training steps.
However, my two problems are:-
1- Based on this link, I can not run evaluation eval.py (for memory reason) at the same time with train.py.
2- I tried to use events.out.tfevents.1503337171 file that I created from training steps, but it seems it has not been created correctly.
So, I don't know where I am mistaken, I think my data structure is not correct, I tried to arrange it based on my understanding.
Thanks in advance
Edit:-
Regarding Q2/
I figured it out how to convert the events files and model.ckpt files (that I created them from training process) to inference_graph_.pb . The inference_graph_.pb could be tested later with object_detection_tutorial.ipynb. For my case I tried it, but I could not detect anything since I am mistaken somewhere during train.py process.
The following steps convert the trained files to .pb files
(abdu-py2) jesse#jesse-System-Product-Name:~/abdu-py2/models$ python object_detection/export_inference_graph.py \
--input_type image_tensor \
--pipeline_config_path /home/jesse/abdu-py2/models/model/ssd_mobilenet_v1_pets.config \
--trained_checkpoint_prefix /home/jesse/abdu-py2/models/model/train/model.ckpt-27688 \
--output_directory /home/jesse/abdu-py2/models/model
Question 1 - this is just a problem that you'll encounter because of your hardware. Once you get to a point where you'd like to a evaluate the model, just stop your training and run your eval command (it seems as though you've successfully evaluated your model, so you know the command). It will provide you a some metrics for the most recent model checkpoint. You can iterate through this process until you're comfortable with the performance of your model.
Question 2 - These event files are used as input into Tensorboard. The events files are in binary format, thus are not human readable. Start a Tensorboard application while your model is training and/or evaluating. To do so, run something like this:
tensorboard --logdir=train:/home/grasp001/abdu-py2/models/object_detection/train1/train,eval:/home/grasp001/abdu-py2/models/object_detection/train1/eval
Once you have Tensorboard running, use your web browser to navigate to localhost:6006 to check out your metrics. You can use this during training as well to monitor loss and other metrics for each step of training.
Trainer.py line 370 after the session_config
Limit the gpu proccess power
session_config.gpu_options.per_process_gpu_memory_fraction = 0.5
and then you can run eval.py at the same time. The tensorflow use all the free memory independently if it needs it