My custom model's loss converged to a high value(Tensorflow)

My custom model's loss converged to a high value(Tensorflow) - tensorflow

I'm training my custom model that detects only person.
I used tensorflow object detection API and followed this github document.
I got images from coco dataset.
400 test images and 1600 train images are prepared.
Here is my train config.
train_config: {
batch_size: 6
optimizer {
rms_prop_optimizer: {
learning_rate: {
exponential_decay_learning_rate {
initial_learning_rate: 0.0001
decay_steps: 250
decay_factor: 0.9
}
}
momentum_optimizer_value: 0.9
decay: 0.9
epsilon: 1.0
}
}
And environment.
Tensorflow : 1.13.1 gpu
GPU : GTX 1070
CDUA : 10.0
cuDNN : 7.4.2
According to above github document, the loss should be under 2.
But my model's loss always converges to 4.
Is there any problems??
I can't figure out what is wrong...
Thanks for all help.

Related

Training loss fluctuates and does not go down effectively

I am training the SSD efficientnet B0 on Pascal Voc 2012 dataset from scratch (It needs to be from scratch) with tensorflow object detection API, i have been training for about more than 60k step with batch size of 5 (More than that and i have OOM error), however the loss fluctuates alot and really struggle to drop even abit as can seen in training loss graph, has my training gone wrong or is it the optimizer that i set is not suitable? How would i solve it?
The optimizer configurations is as below:
optimizer {
adam_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: .01
schedule {
step: 20000
learning_rate: .001
}
schedule {
step: 40000
learning_rate: .0001
}
schedule {
step: 60000
learning_rate: .00001
}
}
}
Please help, thanks.

Which losses scalar shoud I watch in tensorboard?

I'm training my custom object detection model that only detects person.
I followed this github document.
Here is my environment.
Tensorflow version : 1.13.1 GPU
Tensorboard version : 1.13.1
Pre-trained model : SSD mobilenet v2 quantized 300x300 coco
According to above document, my model's loss should drop down under 2.
So I opened tensorboard and found classification losses, total losses, clone losses scalars.
Here is my tensorboard snapshot.
Which losses scalars should I look up?
I also have problem with converging.
My model's loss converged at 4, not 2.
How can I fix it?
I reduced the learning rate 0.004(provided value) to 0.0000095.
But it still converged at 4.
Here is my train_config.
train_config: {
batch_size: 6
optimizer {
rms_prop_optimizer: {
learning_rate: {
exponential_decay_learning_rate {
initial_learning_rate: 0.0000095
decay_steps: 500
decay_factor: 0.95
}
}
momentum_optimizer_value: 0.9
decay: 0.9
epsilon: 1.0
}
Is there anything to change?
Thanks for all help.

model_main.py faster-rcnn CUDA_ERROR_OUT_OF_MEMORY

Description:
I am able to train faster-rcnn model with legacy/train.py, but it runs into problem as below when I try to use model_main.py to train with the same config setting.
Image resolution: 1920x1080
tensorflow/stream_executor/cuda/cuda_driver.cc:890] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
.\tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592
tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (256): Total Chunks: 4753, Chunks in use: 4753. 1.16MiB allocated for chunks. 1.16MiB in use in bin. 144.3KiB client-requested in use in bin.
tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0000000203800000 next 1 of size 256
What I have tried:
Set batch size to 1
use memory growing
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
or
session_config = tf.ConfigProto()
session_config.gpu_options.allow_growth = True
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, session_config=session_config, log_step_count_steps=10, save_summary_steps=20, keep_checkpoint_max=20, save_checkpoints_steps=100)
don't allocate whole of your GPU memory
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.6
session = tf.Session(config=config)
or
session_config = tf.ConfigProto()
session_config.gpu_options.per_process_gpu_memory_fraction = 0.6
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir,
session_config=session_config, log_step_count_steps=10,
save_summary_steps=20, keep_checkpoint_max=20,
save_checkpoints_steps=100)
TensorFlow CUDA_ERROR_OUT_OF_MEMORY
Setting of queue_capacity, min_after_dequeue, num_readers, batch_queue_capacity, num_batch_queue_threads, prefetch_queue_capacity
Out Of Memory when training on Big Images
reduce min_dimension, max_dimension to 270, 480
None of these work for me.
Environment:
OS Platform and Distribution: Win 10 pro version: 1909
TensorFlow installed from: pip tensorflow-gpu
TensorFlow version 1.14
object-detection: 0.1 CUDA/cuDNN version: Cuda 10.0, Cudnn 10.0
GPU model and memory: NVIDIA GeForce RTX 2070 SUPER, Memory 8 G
system memory: 32G
My config:
# Faster R-CNN with Inception v2, configured for Oxford-IIIT Pets Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.
model {
faster_rcnn {
num_classes: 2
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 1080
max_dimension: 1920
}
}
feature_extractor {
type: 'faster_rcnn_inception_v2'
first_stage_features_stride: 16
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 300
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0002
schedule {
step: 900000
learning_rate: .00002
}
schedule {
step: 1200000
learning_rate: .000002
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: ""
from_detection_checkpoint: true
load_all_detection_checkpoint_vars: true
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
num_steps: 200000
data_augmentation_options {
random_horizontal_flip {
}
}
batch_queue_capacity: 60
num_batch_queue_threads: 30
prefetch_queue_capacity: 40
}
train_input_reader: {
tf_record_input_reader {
input_path: "D:\\object_detection\\train_data\\train.record"
}
label_map_path: "D:\\object_detection\\pascal_label_map.pbtxt"
queue_capacity: 2
min_after_dequeue: 1
num_readers: 1
}
eval_config: {
metrics_set: "coco_detection_metrics"
num_examples: 1101
}
eval_input_reader: {
tf_record_input_reader {
input_path: "D:\\object_detection\\eval_data\\eval.record"
}
label_map_path: "D:\\object_detection\\pascal_label_map.pbtxt"
shuffle: false
num_readers: 1
}
If there are other solutions, I will be very grateful to you.

Object detection models consume a lot of memory. This is because how they work and the large amount of anchors that they generate to find the boxes.
You are doing all fine, but your GPU is not enough for training these kind of models.
Things you can do:
Reduce the image size, lets say something like 720x512
Use SGD as optimizer, instead of others optimizers such Adam. SGD consumes approximately 3 times less memory than Adam.
Also is worth to mention that you are doing well with small batches of 1 instances. If I am not wrong, FasterRCNN is trained with only 2 images per batch

I just found that if I set batch_size to 3, then it works normally. When I set batch_size back to 1, it encounters OOM problem.
It is weird and I still don't know why, since it should always save memory with lower batch size.
If you encounter the same situation, can try to increase the batch size slightly, but I cannot guarantee it will work.

Transfer learning using tensorflow object detection api

I'm trying to train the model using pretrained faster_rcnn_inception_v2_coco. I'm using the following config file:
model {
faster_rcnn {
num_classes: 37
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 1080
max_dimension: 1365
}
}
feature_extractor {
type: "faster_rcnn_inception_v2"
first_stage_features_stride: 8
}
first_stage_anchor_generator {
grid_anchor_generator {
height_stride: 16
width_stride: 16
scales: 0.25
scales: 0.5
scales: 1.0
scales: 2.0
aspect_ratios: 0.5
aspect_ratios: 1.0
aspect_ratios: 2.0
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0001
}
}
initializer {
truncated_normal_initializer {
stddev: 0.00999999977648
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.699999988079
first_stage_max_proposals: 31
second_stage_batch_size: 30
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0001
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
use_dropout: true
dropout_keep_probability: 0.20
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.600000023842
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SOFTMAX
}
second_stage_classification_loss{
weighted_sigmoid_focal{
gamma:2
alpha:0.5
}
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config {
batch_size: 1
data_augmentation_options {
random_jitter_boxes {
}
}
optimizer {
adam_optimizer {
learning_rate {
manual_step_learning_rate {
initial_learning_rate: 4.99999987369e-05
schedule {
step: 160000
learning_rate: 1e-05
}
schedule {
step: 175000
learning_rate: 1e-06
}
}
}
}
use_moving_average: true
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: "/home/deploy/tensorflow/models/research/object_detection/ved/model.ckpt"
from_detection_checkpoint: true
num_steps: 400000
}
train_input_reader {
label_map_path: "/home/deploy/tensorflow/models/research/object_detection/ved/tomato.pbtxt"
tf_record_input_reader {
input_path: "/home/deploy/tensorflow/models/research/object_detection/ved/train.record"
}
}
eval_config {
num_visualizations: 4
max_evals: 5
num_examples: 4
max_num_boxes_to_visualize : 100
metrics_set: "coco_detection_metrics"
eval_interval_secs: 600
}
eval_input_reader {
label_map_path: "/home/deploy/tensorflow/models/research/object_detection/ved/tomato.pbtxt"
shuffle: true
num_epochs: 1
num_readers: 1
tf_record_input_reader {
input_path: "/home/deploy/tensorflow/models/research/object_detection/ved/val.record"
}
sample_1_of_n_examples: 2
}
But I'm getting following error:
InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Assign requires shapes of both tensors to match. lhs shape= [148] rhs shape= [40]
[[node save/Assign_728 (defined at /home/deploy/tensorflow/models/research/object_detection/model_lib.py:490) = Assign[T=DT_FLOAT, _class=["loc:#SecondStageBoxPredictor/BoxEncodingPredictor/biases"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](SecondStageBoxPredictor/BoxEncodingPredictor/biases, save/RestoreV2/_1457)]]
[[{{node save/RestoreV2/_1768}} = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1773_save/RestoreV2", _device="/job:localhost/replica:0/task:0/device:CPU:0"](save/RestoreV2:884)]]
I don't know why it's happening. I've changed num_classes, first_stage_max_proposals and second_stage_batch_size.

Try correcting the checkpoint file path. The checkpoint file should be from the same model that is used for training. Usually, it comes with the pre trained models package downloaded from TensorFlow Model Zoo.
Try to fix in this line:
fine_tune_checkpoint: "/home/deploy/tensorflow/models/research/object_detection/ved/model.ckpt"

Hope this helps others trying to do Transfer learning using tensorflow object detection api
I found this Transfer learning with TensorFlow Hub, this link is about classification changing the code for object detection should be a nice learning curve for how every tries it out.
It basically has 3 steps
This tutorial demonstrates how to:
Use models from TensorFlow Hub with tf.keras.
Use an image classification model from TensorFlow Hub.
Do simple transfer learning to fine-tune a model for your own image classes.
You can have a look at the Downloading the model section which has a classifier as a example
You can check out pre trained object detection model that the Tensorflow Hub currently support
Here are a few good once's
Model name
Speed (ms)
COCO mAP
CenterNet HourGlass104 512x512
70
41.9
EfficientDet D2 768x768
67
41.8
EfficientDet D3 896x896
95
45.4
SSD ResNet101 V1 FPN 640x640 (RetinaNet101)
57
35.6
CenterNet Resnet101 V1 FPN 512x512
34
34.2
CenterNet Resnet50 V1 FPN 512x512
27
31.2
Step 1 and Step 2 can be completed if you follow the previous section properly.
You can use this section simple_transfer_learning
You have to got through the entire Transfer learning with TensorFlow Hub to understand more

Training loss fluctuates a lot

I am fine tuning SSD Mobilenet (COCO) on Pascal VOC dataset. I have around 17K images in the training set and num_steps is 100000. Details of config are -
train_config: {
batch_size: 1
optimizer {
rms_prop_optimizer: {
learning_rate: {
exponential_decay_learning_rate {
initial_learning_rate: 0.0001
decay_steps: 800720
decay_factor: 0.95
}
}
momentum_optimizer_value: 0.9
decay: 0.9
epsilon: 1.0
}
However the training loss fluctuates a lot as shown here training loss
How can I avoid this ?
thanks

It seems likely that your learning rate, though you're decaying it, is
still too large in later steps.
Things that I recommend at that point would be to:
Increase your decay
Try another optimizer (e.g. ADAM which worked good for me in such cases)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

My custom model's loss converged to a high value(Tensorflow) - tensorflow

Related

Training loss fluctuates and does not go down effectively

Which losses scalar shoud I watch in tensorboard?

model_main.py faster-rcnn CUDA_ERROR_OUT_OF_MEMORY

Transfer learning using tensorflow object detection api

Training loss fluctuates a lot

Categories

Resources