Tensorflow mIOU and pixel accuracy bug? - tensorflow

Let's say I started training a tensorflow model from scratch with 1000 training steps. I get the following result at the completion of training.
Now, lets say I want to train for 2000 training steps from the previously saved checkpoint. I get the mIOU and pixel_accuracy = 1.0.
I am using '''TensorFlow v1.13.1'''. How can I fix this bug or problem ?

The problem was with my data-set. I was assigning background_tag=1, class_one=2, class_two=3. After modifying ground truth images python PIL everything worked normally.

Related

Function CNN model written in Keras 2.2.4 not learning in TensorFlow or Keras 2.4

I am dealing with an object detection problem and using a model which is actually functioning (its results have been published on a paper and I have the original code). Originally, the code was written with Keras 2.2.4 without importing TensorFlow and trained and tested on the same dataset that I am using at the moment. However, when I try to run the same model with TensorFlow 2.x it just won't learn a thing.
I have tried importing everything from TensorFlow 2.4, but I have the same problem if I import everything (layers, models, optimizers...) from Keras 2.4. And I have tried to do so on two different devices, both using a GPU. Namely, what is happening is that the loss function decreases ridiculously fast, but the accuracy won't increase a bit (or, if it does, it gets stuck around 10% or smth). Also, every now and then this happens from an epoch to the next one:
Loss undergoes HUGE jumps between consecutive epochs, and all this without any changes in accuracy
I have tried to train the network on another dataset (had to change the last layers in order to match the required dimensions) and the model seemed to be learning in a normal way, i.e. the accuracy actually increases and the loss doesn't reach 0.0x in one epoch.
I can't post the script, but the model is an Encoder-Decoder network: consecutive Convolutions with increasing number of filters reduce the dimensions of the image, and a specular path of Transposed Convolutions restores the original dimensions. So basically the network only contains:
Conv2D
Conv2DTranspose
BatchNormalization
Activation("relu")
Activation("sigmoid")
concatenate
6 is used to put together outputs from parallel paths or distant layers; 3 and 4 are used after every Conv or ConvTranspose; 5 is only used as final activation function, i.e. as output layer.
I think the problem is pretty generic and I am honestly surprised that I couldn't find a single question about it. What could be happening here? The problem must have something to do with TF/Keras versions, but I can't find any documentation about it and I have been trying to change so many things but nothing changes. It's crazy because if I didn't know that the model works I would try to rewrite it from scratch so I am afraid that this problem may occurr with a new network and I won't be able to understand whether it's the libraries or the model itself.
Thank you in advance! :)
EDIT
Code snippets:
Convolutional block:
encoder1 = Conv2D(filters=first_layer_channels, kernel_size=2, strides=2)(input)
encoder1 = BatchNormalization()(encoder1)
encoder1 = Activation('relu')(encoder1)
Decoder
decoder1 = Conv2DTranspose(filters=first_layer_channels, kernel_size=2, strides=2)(encoder4)
decoder1 = BatchNormalization()(decoder1)
decoder1 = Activation('relu')(decoder1)
Final layers:
final = Conv2D(filters=total, kernel_size=1)(decoder4)
final = BatchNormalization()(final)
Last_Conv = Activation('sigmoid')(final)
The task is human pose estimation: the network (which, I recall, works on this specific task with Keras 2.2.4) has to predict twenty binary maps containing the positions of specific keypoints.

YOLOv4 loss too high

I am using YOLOv4-tiny for a custom dataset of 26 classes that I collected from Open Images Dataset. The dataset is almost balanced(850 images per class but different number of bounding boxes). When I used YOLOv4-tiny to train on just 3 classes the loss was near 0.5, it was fairly accurate. But for 26 classes as soon as the loss goes below 2 the model starts to overfit. The prediction are also very inaccurate.
I have tried to change the parameters like the learning rate, the momentum and the size but whatever I do the models becomes worse then before. Using regular YOLOv4 model rather then YOLO-tiny does not help either. How can I bring the loss further down?
Have you tried training with mAP? You can take a subset of your training set and make it the validation set. This can be done in the same way you made your training and test set. Then, you can run darknet.exe detector train data/obj.data yolo-obj.cfg yolov4.conv.137 -map. This will keep track of the loss in your validation set. When the error in the validation say goes up, this is the time to stop training and prevent overfitting (this is called: early stopping).
You need to run the training for (classes*2000)iterations. However, for the best scores, you need to train your model for at least 6000 iterations (also known as max_batches). Also please remember if you are using a b&w image, change the channels=3 to channels=1. You can stop your training once the avg loss becomes something like this: 0.XXXX.
Here's my mAP graph for 6000 iterations that ran for 6.2 hours:
avg loss with 6000 max_batches.
Moreover, you can follow this FAQ documentation here by Stéphane Charette.

Using ssd_inception_v2 to train on different resolution

The dataset contains images of different sizes.
The pretrained weights are trained on 300x300 resolution.
I am training on widerface dataset where objects are as small as 15x15.
Q1. I want to train with 800x800 resolution do i need to resize all the images manually or this will be done by Tensorflow automatically ?
I am using the following command to train:
python3 /opt/github/models/research/object_detection/legacy/train.py --logtostderr --train_dir=/opt/github/object_detection_retraining/wider_face_checkpoint/ --pipeline_config_path=/opt/github/object_detection_retraining/models/ssd_inception_v2_coco_2018_01_28/pipeline.config
Q2. I also tried training it using the model_main.py but after 1000 iterations it is evaluating the dataset with each iteration.
I am using the following command to train:
python3 /opt/github/models/research/object_detection/model_main.py --num_train_steps=200000 --logtostderr --model_dir=/opt/github/object_detection_retraining/wider_face_checkpoint/ --pipeline_config_path=/opt/github/object_detection_retraining/models/ssd_inception_v2_coco_2018_01_28/pipeline.config
Q3. Also if you can suggest any model i should use for real time face detection apart from mobilenet and inception, please suggest.
Thanks.
Q1. No you do not need to resize manually. See this detailed answer.
Q2. By 1000 iterations you meant steps right? (An iteration counts as a complete cycle of the dataset.) Usually the model performed evaluation after a certain amount of time, e.g. 10 minutes. So in every 10 minutes, the checkpoints are saved and an evaluation of the model on evaluation set is performed.
Q3. SSD models with mobilenet is one of the fast detectors, apart from that you can try YOLO models for real time detection

when to stop training object detection tensorflow

I am training faster rcnn model on fruit dataset using a pretrained model provided in google api(faster_rcnn_inception_resnet_v2_atrous_coco).
I made few changes to the default configuration. (number of classes : 12 fine_tune_checkpoint: path to the pretrained checkpoint model and from_detection_checkpoint: true). Total number of annotated images I have is around 12000.
After training for 9000 steps, the results I got have an accuracy percent below 1, though I was expecting it to be atleast 50% (In evaluation nothing is getting detected as accuracy is almost 0). The loss fluctuates in between 0 and 4.
What should be the number of steps I should train it for. I read an article which says to run around 800k steps but its the number of step when you train from scratch?
FC layers of the model are changed because of the different number of the classes but it should not effect those classes which are already present in the pre-trained model like 'apple'?
Any help would be much appreciated!
You shouldn't look at your training loss to determine when to stop. Instead, you should run your model through the evaluator periodically, and stop training when the evaluation mAP stops improving.

Tensorflow-loss not decreasing when training

I am using tensorflow object detection api for my own dataset I am facing some problem. I am using centos , with GPU Geforce 1080, 8 GB GPU memory, tensorflow 1.2.1 . I have 500 images in training set and 40 in test. I did the following steps and I have two problems.
1.I annotated my images using LabelImg tool
2.Created tfrecord successfully
3.I used ssd_inception_v2_coco.config. I modified the only path, no of class and I did not train from scratch, I used ssd_inception_v2_coco model checkpoints.
Problem 1: from step 0 until 3000, my loss has dramatically decreased but after that, it stays constant between 5 to 6 . Not getting how I reduce it but still my model able to detect required object. Here is my Tensorborad samples
Even i tried for diffent model eg. faster_rcnn_inception_resnet_v2_atrous_coco after some steps loss stay constant between 1 and 2
Problem 2: according to a document I able to run eval.py but getting the following error:
WARNING:root:The following classes have no ground truth examples: 0 after that program terminate.
I try to run train.py and eval.py at the same time still same error.
Please give me a suggestion. I am tensorflow beginner required suggestion.
The loss curve you're seeing on Tensorboard is quite normal. Initially, the loss will drop very quickly, but will seemingly "bottom out" over time. Training is a slow process, you should see a steady drop over time after more iterations.