Tensorflow-loss not decreasing when training - tensorflow

I am using tensorflow object detection api for my own dataset I am facing some problem. I am using centos , with GPU Geforce 1080, 8 GB GPU memory, tensorflow 1.2.1 . I have 500 images in training set and 40 in test. I did the following steps and I have two problems.
1.I annotated my images using LabelImg tool
2.Created tfrecord successfully
3.I used ssd_inception_v2_coco.config. I modified the only path, no of class and I did not train from scratch, I used ssd_inception_v2_coco model checkpoints.
Problem 1: from step 0 until 3000, my loss has dramatically decreased but after that, it stays constant between 5 to 6 . Not getting how I reduce it but still my model able to detect required object. Here is my Tensorborad samples
Even i tried for diffent model eg. faster_rcnn_inception_resnet_v2_atrous_coco after some steps loss stay constant between 1 and 2
Problem 2: according to a document I able to run eval.py but getting the following error:
WARNING:root:The following classes have no ground truth examples: 0 after that program terminate.
I try to run train.py and eval.py at the same time still same error.
Please give me a suggestion. I am tensorflow beginner required suggestion.

The loss curve you're seeing on Tensorboard is quite normal. Initially, the loss will drop very quickly, but will seemingly "bottom out" over time. Training is a slow process, you should see a steady drop over time after more iterations.

Related

YOLOv4 loss too high

I am using YOLOv4-tiny for a custom dataset of 26 classes that I collected from Open Images Dataset. The dataset is almost balanced(850 images per class but different number of bounding boxes). When I used YOLOv4-tiny to train on just 3 classes the loss was near 0.5, it was fairly accurate. But for 26 classes as soon as the loss goes below 2 the model starts to overfit. The prediction are also very inaccurate.
I have tried to change the parameters like the learning rate, the momentum and the size but whatever I do the models becomes worse then before. Using regular YOLOv4 model rather then YOLO-tiny does not help either. How can I bring the loss further down?
Have you tried training with mAP? You can take a subset of your training set and make it the validation set. This can be done in the same way you made your training and test set. Then, you can run darknet.exe detector train data/obj.data yolo-obj.cfg yolov4.conv.137 -map. This will keep track of the loss in your validation set. When the error in the validation say goes up, this is the time to stop training and prevent overfitting (this is called: early stopping).
You need to run the training for (classes*2000)iterations. However, for the best scores, you need to train your model for at least 6000 iterations (also known as max_batches). Also please remember if you are using a b&w image, change the channels=3 to channels=1. You can stop your training once the avg loss becomes something like this: 0.XXXX.
Here's my mAP graph for 6000 iterations that ran for 6.2 hours:
avg loss with 6000 max_batches.
Moreover, you can follow this FAQ documentation here by Stéphane Charette.

Tensorflow mIOU and pixel accuracy bug?

Let's say I started training a tensorflow model from scratch with 1000 training steps. I get the following result at the completion of training.
Now, lets say I want to train for 2000 training steps from the previously saved checkpoint. I get the mIOU and pixel_accuracy = 1.0.
I am using '''TensorFlow v1.13.1'''. How can I fix this bug or problem ?
The problem was with my data-set. I was assigning background_tag=1, class_one=2, class_two=3. After modifying ground truth images python PIL everything worked normally.

Using ssd_inception_v2 to train on different resolution

The dataset contains images of different sizes.
The pretrained weights are trained on 300x300 resolution.
I am training on widerface dataset where objects are as small as 15x15.
Q1. I want to train with 800x800 resolution do i need to resize all the images manually or this will be done by Tensorflow automatically ?
I am using the following command to train:
python3 /opt/github/models/research/object_detection/legacy/train.py --logtostderr --train_dir=/opt/github/object_detection_retraining/wider_face_checkpoint/ --pipeline_config_path=/opt/github/object_detection_retraining/models/ssd_inception_v2_coco_2018_01_28/pipeline.config
Q2. I also tried training it using the model_main.py but after 1000 iterations it is evaluating the dataset with each iteration.
I am using the following command to train:
python3 /opt/github/models/research/object_detection/model_main.py --num_train_steps=200000 --logtostderr --model_dir=/opt/github/object_detection_retraining/wider_face_checkpoint/ --pipeline_config_path=/opt/github/object_detection_retraining/models/ssd_inception_v2_coco_2018_01_28/pipeline.config
Q3. Also if you can suggest any model i should use for real time face detection apart from mobilenet and inception, please suggest.
Thanks.
Q1. No you do not need to resize manually. See this detailed answer.
Q2. By 1000 iterations you meant steps right? (An iteration counts as a complete cycle of the dataset.) Usually the model performed evaluation after a certain amount of time, e.g. 10 minutes. So in every 10 minutes, the checkpoints are saved and an evaluation of the model on evaluation set is performed.
Q3. SSD models with mobilenet is one of the fast detectors, apart from that you can try YOLO models for real time detection

Avoiding overfitting while training a neural network with Tensorflow

I am training a neural network using Tensorflow's object detetction API to detect cars. I used the following youtube video to learn and execute the process.
https://www.youtube.com/watch?v=srPndLNMMpk&t=65s
Part 1 to 6 of his series.
Now in his video, he has mentioned to stop the training when the loss value reaches ~1 or below on an average and that it would take about 10000'ish' steps.
In my case, it is 7500 steps right now and the loss values keep fluctuating from 0.6 to 1.3.
Alot of people complained in the comment section about false positives on this series but I think this happened because of the unnecessary prolonged process of training (because they din't know maybe when to stop ?) which caused overfitting!
I would like to avoid this problem. I would like to have not the most optimum weights but fairly optimum weights while avoiding false detection or overfitting. I am also observing 'Total Loss' section of Tensorboard. It fluctuates between 0.8 to 1.2. When do I stop the training process?
I would also like to know in general, which factors does the 'stopping of training' depend on? is it always about the average loss of 1 or less?
Additional information:
My training data has ~300 images
Test data ~ 20 images
Since I am using the concept of transfer learning, I chose ssd_mobilenet_v1.model.
Tensorflow version 1.9 (on CPU)
Python version 3.6
Thank you!
You should use a validation test, different from the training set and the test set.
At each epoch, you compute the loss of both training and validation set.
If the validation loss begin to increase, stop your training. You can now test your model on your test set.
The Validation set size is usually the same as the test one. For example, training set is 70% and both validation and test set are 15% each.
Also, please note that 300 images in your dataset seems not enough. You should increase it.
For your other question :
The loss is the sum of your errors, and thus, depends on the problem, and your data. A loss of 1 does not mean much in this regard. Never rely on it to stop your training.

OOM error when training TensorFlow object-detection API using inception-resnet & NASnet (especially) as backbone

Please help me finding the solution to my problems. It's important for me to state first that, I have successfully created my own custom dataset and I have successfully trained that dataset using resnet101 on my own computer (16GB RAM and 4GB NVIDIA 980).
The problem arise when I tried to switch the backbone using inception-resnet and nasnet. I got the following error
"ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape ..."
And I thought I didn't have enough resource on my computer, so I created instance on AWS EC2 with 60GB RAM and 12GB NVIDIA Tesla K80 (my work place only provide this service) and trained the network there.
The training for inception-resnet worked well, however that's not the case with nasnet. Even with 100GB memory I still get OOM error
I found one solution on github tensorflow models web page at issue #1817 and I followed the instruction by adding the following line of code into nasnet config file
train_config: {
batch_size: 1
batch_queue_capacity: 50
num_batch_queue_threads: 8
prefetch_queue_capacity: 10
...
and the code ran well for a while (the following is "top" screenshot). However, I still got the OOM error after running around 6000 steps
INFO:tensorflow:global step 6348: loss = 2.0393 (3.988 sec/step)
INFO:tensorflow:Saving checkpoint to path /home/ubuntu/crack-detection/structure-crack/models/faster_rcnn_nas_coco_2017_11_08/train/model.ckpt
INFO:tensorflow:global step 6349: loss = 0.9803 (3.980 sec/step)
2018-01-25 05:51:25.959402: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 79.73MiB. Current allocation summary follows.
...
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[64,17,17,4032]
[[Node: MaxPool2D/MaxPool = MaxPool[T=DT_FLOAT, data_format="NHWC", ksize=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1],
...
Is there anything else I could do to run this smoothly without any OOM errors? Thanks for your help
EDIT #1: The errors come more frequently now, it'll show after 1000-1500 steps.
EDIT #2: Based on the issue #2668 and issue #3014, there's one more thing we can do to be able to run the code without OOM error by adding second_stage_batch_size: 25 (default is 50) in model section of the config file. So, the file should look like the following
model{
faster_rcnn {
...
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
second_stage_batch_size: 25
}
}
Hope this can help.
I would like to point out that the memory that you run out of is the one of the GPU, so I'm afraid those 100GB are only useful for data wrangling outside a training purpose. Also, without code, it's really difficult to figure out where the error is coming from.
That being said, if you can initialize the neural net architecture with weights, train for 6000 iterations and suddenly run out of GPU memory then I guess that you are either somehow storing values in GPU memory or, if you have variable length inputs, you might be passing a sequence, in that iteration, which is too big memory wise.