model_main.py faster-rcnn CUDA_ERROR_OUT_OF_MEMORY - tensorflow

Description:
I am able to train faster-rcnn model with legacy/train.py, but it runs into problem as below when I try to use model_main.py to train with the same config setting.
Image resolution: 1920x1080
tensorflow/stream_executor/cuda/cuda_driver.cc:890] failed to alloc 8589934592 bytes on host: CUDA_ERROR_OUT_OF_MEMORY: out of memory
.\tensorflow/core/common_runtime/gpu/gpu_host_allocator.h:44] could not allocate pinned host memory of size: 8589934592
tensorflow/core/common_runtime/bfc_allocator.cc:764] Bin (256): Total Chunks: 4753, Chunks in use: 4753. 1.16MiB allocated for chunks. 1.16MiB in use in bin. 144.3KiB client-requested in use in bin.
tensorflow/core/common_runtime/bfc_allocator.cc:800] InUse at 0000000203800000 next 1 of size 256
What I have tried:
Set batch size to 1
use memory growing
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
or
session_config = tf.ConfigProto()
session_config.gpu_options.allow_growth = True
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, session_config=session_config, log_step_count_steps=10, save_summary_steps=20, keep_checkpoint_max=20, save_checkpoints_steps=100)
don't allocate whole of your GPU memory
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.6
session = tf.Session(config=config)
or
session_config = tf.ConfigProto()
session_config.gpu_options.per_process_gpu_memory_fraction = 0.6
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir,
session_config=session_config, log_step_count_steps=10,
save_summary_steps=20, keep_checkpoint_max=20,
save_checkpoints_steps=100)
TensorFlow CUDA_ERROR_OUT_OF_MEMORY
Setting of queue_capacity, min_after_dequeue, num_readers, batch_queue_capacity, num_batch_queue_threads, prefetch_queue_capacity
Out Of Memory when training on Big Images
reduce min_dimension, max_dimension to 270, 480
None of these work for me.
Environment:
OS Platform and Distribution: Win 10 pro version: 1909
TensorFlow installed from: pip tensorflow-gpu
TensorFlow version 1.14
object-detection: 0.1 CUDA/cuDNN version: Cuda 10.0, Cudnn 10.0
GPU model and memory: NVIDIA GeForce RTX 2070 SUPER, Memory 8 G
system memory: 32G
My config:
# Faster R-CNN with Inception v2, configured for Oxford-IIIT Pets Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.
model {
faster_rcnn {
num_classes: 2
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 1080
max_dimension: 1920
}
}
feature_extractor {
type: 'faster_rcnn_inception_v2'
first_stage_features_stride: 16
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 300
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0002
schedule {
step: 900000
learning_rate: .00002
}
schedule {
step: 1200000
learning_rate: .000002
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: ""
from_detection_checkpoint: true
load_all_detection_checkpoint_vars: true
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
num_steps: 200000
data_augmentation_options {
random_horizontal_flip {
}
}
batch_queue_capacity: 60
num_batch_queue_threads: 30
prefetch_queue_capacity: 40
}
train_input_reader: {
tf_record_input_reader {
input_path: "D:\\object_detection\\train_data\\train.record"
}
label_map_path: "D:\\object_detection\\pascal_label_map.pbtxt"
queue_capacity: 2
min_after_dequeue: 1
num_readers: 1
}
eval_config: {
metrics_set: "coco_detection_metrics"
num_examples: 1101
}
eval_input_reader: {
tf_record_input_reader {
input_path: "D:\\object_detection\\eval_data\\eval.record"
}
label_map_path: "D:\\object_detection\\pascal_label_map.pbtxt"
shuffle: false
num_readers: 1
}
If there are other solutions, I will be very grateful to you.

Object detection models consume a lot of memory. This is because how they work and the large amount of anchors that they generate to find the boxes.
You are doing all fine, but your GPU is not enough for training these kind of models.
Things you can do:
Reduce the image size, lets say something like 720x512
Use SGD as optimizer, instead of others optimizers such Adam. SGD consumes approximately 3 times less memory than Adam.
Also is worth to mention that you are doing well with small batches of 1 instances. If I am not wrong, FasterRCNN is trained with only 2 images per batch

I just found that if I set batch_size to 3, then it works normally. When I set batch_size back to 1, it encounters OOM problem.
It is weird and I still don't know why, since it should always save memory with lower batch size.
If you encounter the same situation, can try to increase the batch size slightly, but I cannot guarantee it will work.

Related

Transfer learning using tensorflow object detection api

I'm trying to train the model using pretrained faster_rcnn_inception_v2_coco. I'm using the following config file:
model {
faster_rcnn {
num_classes: 37
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 1080
max_dimension: 1365
}
}
feature_extractor {
type: "faster_rcnn_inception_v2"
first_stage_features_stride: 8
}
first_stage_anchor_generator {
grid_anchor_generator {
height_stride: 16
width_stride: 16
scales: 0.25
scales: 0.5
scales: 1.0
scales: 2.0
aspect_ratios: 0.5
aspect_ratios: 1.0
aspect_ratios: 2.0
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0001
}
}
initializer {
truncated_normal_initializer {
stddev: 0.00999999977648
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.699999988079
first_stage_max_proposals: 31
second_stage_batch_size: 30
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0001
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
use_dropout: true
dropout_keep_probability: 0.20
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.600000023842
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SOFTMAX
}
second_stage_classification_loss{
weighted_sigmoid_focal{
gamma:2
alpha:0.5
}
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config {
batch_size: 1
data_augmentation_options {
random_jitter_boxes {
}
}
optimizer {
adam_optimizer {
learning_rate {
manual_step_learning_rate {
initial_learning_rate: 4.99999987369e-05
schedule {
step: 160000
learning_rate: 1e-05
}
schedule {
step: 175000
learning_rate: 1e-06
}
}
}
}
use_moving_average: true
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: "/home/deploy/tensorflow/models/research/object_detection/ved/model.ckpt"
from_detection_checkpoint: true
num_steps: 400000
}
train_input_reader {
label_map_path: "/home/deploy/tensorflow/models/research/object_detection/ved/tomato.pbtxt"
tf_record_input_reader {
input_path: "/home/deploy/tensorflow/models/research/object_detection/ved/train.record"
}
}
eval_config {
num_visualizations: 4
max_evals: 5
num_examples: 4
max_num_boxes_to_visualize : 100
metrics_set: "coco_detection_metrics"
eval_interval_secs: 600
}
eval_input_reader {
label_map_path: "/home/deploy/tensorflow/models/research/object_detection/ved/tomato.pbtxt"
shuffle: true
num_epochs: 1
num_readers: 1
tf_record_input_reader {
input_path: "/home/deploy/tensorflow/models/research/object_detection/ved/val.record"
}
sample_1_of_n_examples: 2
}
But I'm getting following error:
InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Assign requires shapes of both tensors to match. lhs shape= [148] rhs shape= [40]
[[node save/Assign_728 (defined at /home/deploy/tensorflow/models/research/object_detection/model_lib.py:490) = Assign[T=DT_FLOAT, _class=["loc:#SecondStageBoxPredictor/BoxEncodingPredictor/biases"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](SecondStageBoxPredictor/BoxEncodingPredictor/biases, save/RestoreV2/_1457)]]
[[{{node save/RestoreV2/_1768}} = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_1773_save/RestoreV2", _device="/job:localhost/replica:0/task:0/device:CPU:0"](save/RestoreV2:884)]]
I don't know why it's happening. I've changed num_classes, first_stage_max_proposals and second_stage_batch_size.
Try correcting the checkpoint file path. The checkpoint file should be from the same model that is used for training. Usually, it comes with the pre trained models package downloaded from TensorFlow Model Zoo.
Try to fix in this line:
fine_tune_checkpoint: "/home/deploy/tensorflow/models/research/object_detection/ved/model.ckpt"
Hope this helps others trying to do Transfer learning using tensorflow object detection api
I found this Transfer learning with TensorFlow Hub, this link is about classification changing the code for object detection should be a nice learning curve for how every tries it out.
It basically has 3 steps
This tutorial demonstrates how to:
Use models from TensorFlow Hub with tf.keras.
Use an image classification model from TensorFlow Hub.
Do simple transfer learning to fine-tune a model for your own image classes.
You can have a look at the Downloading the model section which has a classifier as a example
You can check out pre trained object detection model that the Tensorflow Hub currently support
Here are a few good once's
Model name
Speed (ms)
COCO mAP
CenterNet HourGlass104 512x512
70
41.9
EfficientDet D2 768x768
67
41.8
EfficientDet D3 896x896
95
45.4
SSD ResNet101 V1 FPN 640x640 (RetinaNet101)
57
35.6
CenterNet Resnet101 V1 FPN 512x512
34
34.2
CenterNet Resnet50 V1 FPN 512x512
27
31.2
Step 1 and Step 2 can be completed if you follow the previous section properly.
You can use this section simple_transfer_learning
You have to got through the entire Transfer learning with TensorFlow Hub to understand more

Optimizing Faster R-CNN Inception Resnet v2 for my need

I'm using the Faster R-CNN Inception Resnet v2 model pre-trained on COCO to train my own object detector with the purpose of detecting objects from 3 classes. The objects are small compared to the size (resolution) of the image. I'm relatively new to ML and OD.
I wonder what changes I should make to the model to make it better fit my purpose. Is it a good idea to decrease the complexity of some parts of the model since I only detect 3 classes? Are there any feature extractors better suited for small objects? Is it generally best to train on a pre-trained model or should I train from scratch?
I'm aware that tuning the network to a specific need is a trial-and-error process, however, since it takes about 3 days to train the network I'm looking for some educated guesses.
Model configuration:
model {
faster_rcnn {
num_classes: 3
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 4048
}
}
feature_extractor {
type: 'faster_rcnn_inception_resnet_v2'
first_stage_features_stride: 8
}
first_stage_anchor_generator {
# grid_anchor_generator {
# scales: [0.25, 0.5, 1.0, 2.0, 3.0]
# aspect_ratios: [0.25,0.5, 1.0, 2.0]
# height_stride: 8
# width_stride: 8
# }
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0, 3.0]
aspect_ratios: [1.0, 2.0, 3.0]
height: 64
width: 64
height_stride: 8
width_stride: 8
}
}
first_stage_atrous_rate: 2
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.01
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.4
first_stage_max_proposals: 1000
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 17
maxpool_kernel_size: 1
maxpool_stride: 1
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: True
dropout_keep_probability: 0.9
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.01
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.5
max_detections_per_class: 20
max_total_detections: 20
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.00001
schedule {
step: 100000
learning_rate: .000001
}
schedule {
step: 150000
learning_rate: .0000001
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
# PATH_TO_BE_CONFIGURED: Below line needs to match location of model checkpoint: Either use checkpoint from rcnn model, or checkpoint from previously trained model on other dataset.
fine_tune_checkpoint: "/.../model.ckpt"
from_detection_checkpoint: true
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
# num_steps: 200000
data_augmentation_options {
random_horizontal_flip {}
}
data_augmentation_options {
random_crop_image {
min_object_covered : 1.0
min_aspect_ratio: 0.5
max_aspect_ratio: 2
min_area: 0.2
max_area: 1.
}
}
data_augmentation_options {
random_distort_color {}
}
}
# PATH_TO_BE_CONFIGURED: Need to make sure folder structure below is correct for both train-record and label_map.pbtxt
train_input_reader: {
tf_record_input_reader {
input_path: "/.../train.record"
}
label_map_path: "/..../label_map.pbtxt"
queue_capacity: 500
min_after_dequeue: 250
}
#PATH_TO_BE_CONFIGURED: Make sure folder structure for eval_export, validation.record and label_map.pbtxt below are correct.
eval_config: {
num_examples: 30
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 10
num_visualizations: 30
eval_interval_secs: 600
visualization_export_dir: "/.../eval_export"
}
eval_input_reader: {
tf_record_input_reader {
input_path: "/.../test.record"
}
label_map_path: "/.../label_map.pbtxt"
shuffle: True
num_readers: 1
}

Tensorflow object detection serving

I'm using tensorflow object detection api. The problem with this api is that it exports frozen graph for inference. I can't use that graph for serving. So, as a work around I followed the tutorial here. But when I'm trying to export the graph I'm getting following error:
InvalidArgumentError (see above for traceback): Restoring from
checkpoint failed. This is most likely due to a mismatch between the
current graph and the graph from the checkpoint. Please ensure that
you have not altered the graph expected based on the checkpoint.
Original error:
Assign requires shapes of both tensors to match. lhs shape= [1024,4]
rhs shape= [1024,8]
[[node save/Assign_258 (defined at
/home/deploy/models/research/object_detection/exporter.py:67) =
Assign[T=DT_FLOAT,
_class=["loc:#SecondStageBoxPredictor/BoxEncodingPredictor/weights"], use_locking=true, validate_shape=true,
_device="/job:localhost/replica:0/task:0/device:GPU:0"](SecondStageBoxPredictor/BoxEncodingPredictor/weights,
save/RestoreV2/_517)]] [[{{node save/RestoreV2/_522}} =
_SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0",
send_device="/job:localhost/replica:0/task:0/device:CPU:0",
send_device_incarnation=1, tensor_name="edge_527_save/RestoreV2",
_device="/job:localhost/replica:0/task:0/device:CPU:0"]]
The error says there is a mismatch in the graph. A possible cause might be that I'm using pretrained graph for training which might have 4 classification and my model has 8 classification. (hence mismatch of shape). There is a similar issue for deeplab model and their solution for their
specific model was to start the training with --initialize_last_layer=False and --last_layers_contain_logits_only=False parameters. But tensorflow object detection doesn't have that parameters. So, how should I proceed ? Also, is there any other way of serving tensorflow object detection api ?
My config file looks like this:
model {
faster_rcnn {
num_classes: 1
image_resizer {
fixed_shape_resizer {
height: 1000
width: 1000
resize_method: AREA
}
}
feature_extractor {
type: "faster_rcnn_inception_v2"
first_stage_features_stride: 16
}
first_stage_anchor_generator {
grid_anchor_generator {
height_stride: 16
width_stride: 16
scales: 0.25
scales: 0.5
scales: 1.0
scales: 2.0
aspect_ratios: 0.5
aspect_ratios: 1.0
aspect_ratios: 2.0
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.00999999977648
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.699999988079
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
use_dropout: false
dropout_keep_probability: 1.0
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.600000023842
max_detections_per_class: 100
max_total_detections: 300
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config {
batch_size: 8
data_augmentation_options {
random_horizontal_flip {
}
}
optimizer {
adam_optimizer {
learning_rate {
manual_step_learning_rate {
initial_learning_rate: 0.00010000000475
schedule {
step: 40000
learning_rate: 3.00000010611e-05
}
}
}
}
use_moving_average: true
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: "/home/deploy/models/research/object_detection/faster_rcnn_inception_v2_coco_2018_01_28/model.ckpt"
from_detection_checkpoint: true
num_steps: 60000
max_number_of_boxes: 100
}
train_input_reader {
label_map_path: "/home/deploy/models/research/object_detection/Training_carrot_060219/carrot_identify.pbtxt"
tf_record_input_reader {
input_path: "/home/deploy/models/research/object_detection/Training_carrot_060219/train.record"
}
}
eval_config {
num_visualizations: 100
num_examples: 135
eval_interval_secs: 60
use_moving_averages: false
}
eval_input_reader {
label_map_path: "/home/deploy/models/research/object_detection/Training_carrot_060219/carrot_identify.pbtxt"
shuffle: true
num_epochs: 1
num_readers: 1
tf_record_input_reader {
input_path: "/home/deploy/models/research/object_detection/Training_carrot_060219/test.record"
}
sample_1_of_n_examples: 1
}
When exporting models for tf serving, the config file and checkpoint files should correspond to each other.
The problem is when exporting the custom trained model, you were using the old config file with new checkpoint files.

retrain object_detection not trained

Background:
Windows 10
Tensorflow: 1.12
Followed the official document here. As the dataset is generated from experiment, so there are not many images available, about 50 training image and 10 test image. The pre-trained model is ssd_inception_v2_coco. When training using
python train.py --logtostderr --train_dir=training/ --pipeline_config_path=training/ssd_inception_v2_coco.config
saw the following output and the program quit.
(a million lines here...)
W0423 15:59:38.764785 21492 variables_helper.py:144] Variable [FeatureExtractor/InceptionV2/Mixed_5c_2_Conv2d_5_3x3_s2_128/BatchNorm/beta/RMSProp] is not available in checkpoint
W0423 15:59:38.765782 21492 variables_helper.py:144] Variable [FeatureExtractor/InceptionV2/Mixed_5c_2_Conv2d_5_3x3_s2_128/BatchNorm/beta/RMSProp_1] is not available in checkpoint
W0423 15:59:38.765782 21492 variables_helper.py:144] Variable [FeatureExtractor/InceptionV2/Mixed_5c_2_Conv2d_5_3x3_s2_128/BatchNorm/gamma/ExponentialMovingAverage] is not available in checkpoint
W0423 15:59:38.765782 21492 variables_helper.py:144] Variable [FeatureExtractor/InceptionV2/Mixed_5c_2_Conv2d_5_3x3_s2_128/BatchNorm/gamma/RMSProp] is not available in checkpoint
W0423 15:59:38.765782 21492 variables_helper.py:144] Variable [FeatureExtractor/InceptionV2/Mixed_5c_2_Conv2d_5_3x3_s2_128/BatchNorm/gamma/RMSProp_1] is not available in checkpoint
W0423 15:59:38.765782 21492 variables_helper.py:144] Variable [FeatureExtractor/InceptionV2/Mixed_5c_2_Conv2d_5_3x3_s2_128/weights/ExponentialMovingAverage] is not available in checkpoint
W0423 15:59:38.765782 21492 variables_helper.py:144] Variable [FeatureExtractor/InceptionV2/Mixed_5c_2_Conv2d_5_3x3_s2_128/weights/RMSProp] is not available in checkpoint
W0423 15:59:38.765782 21492 variables_helper.py:144] Variable [FeatureExtractor/InceptionV2/Mixed_5c_2_Conv2d_5_3x3_s2_128/weights/RMSProp_1] is not available in checkpoint
WARNING:tensorflow:From d:\Anaconda3\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py:737: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
W0423 15:59:39.539828 21492 tf_logging.py:125] From d:\Anaconda3\lib\site-packages\tensorflow\contrib\slim\python\slim\learning.py:737: Supervisor.__init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2019-04-23 15:59:41.155297: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-04-23 15:59:41.385078: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.7085
pciBusID: 0000:01:00.0
totalMemory: 11.00GiB freeMemory: 9.11GiB
2019-04-23 15:59:41.390824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-04-23 15:59:42.311427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-04-23 15:59:42.322811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-04-23 15:59:42.324856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-04-23 15:59:42.327029: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8799 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from pre-trained-model/model.ckpt
I0423 15:59:46.439763 21492 tf_logging.py:115] Restoring parameters from pre-trained-model/model.ckpt
INFO:tensorflow:Running local_init_op.
I0423 15:59:46.674186 21492 tf_logging.py:115] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0423 15:59:47.319484 21492 tf_logging.py:115] Done running local_init_op.
INFO:tensorflow:Starting Session.
I0423 15:59:54.453117 21492 tf_logging.py:115] Starting Session.
INFO:tensorflow:Saving checkpoint to path training/model.ckpt
I0423 15:59:54.647598 15672 tf_logging.py:115] Saving checkpoint to path training/model.ckpt
INFO:tensorflow:Starting Queues.
I0423 15:59:54.651614 21492 tf_logging.py:115] Starting Queues.
INFO:tensorflow:global_step/sec: 0
I0423 16:00:01.125150 4792 tf_logging.py:159] global_step/sec: 0
D:\workspace\demo>
And here is the configure file:
model {
ssd {
num_classes: 1
box_coder {
faster_rcnn_box_coder {
y_scale: 10.0
x_scale: 10.0
height_scale: 5.0
width_scale: 5.0
}
}
matcher {
argmax_matcher {
matched_threshold: 0.5
unmatched_threshold: 0.5
ignore_thresholds: false
negatives_lower_than_unmatched: true
force_match_for_each_row: true
}
}
similarity_calculator {
iou_similarity {
}
}
anchor_generator {
ssd_anchor_generator {
num_layers: 6
min_scale: 0.2
max_scale: 0.95
aspect_ratios: 1.0
aspect_ratios: 2.0
aspect_ratios: 0.5
aspect_ratios: 3.0
aspect_ratios: 0.3333
reduce_boxes_in_lowest_layer: true
}
}
image_resizer {
fixed_shape_resizer {
height: 300
width: 300
}
}
box_predictor {
convolutional_box_predictor {
min_depth: 0
max_depth: 0
num_layers_before_predictor: 0
use_dropout: false
dropout_keep_probability: 0.8
kernel_size: 3
box_code_size: 4
apply_sigmoid_to_scores: false
conv_hyperparams {
activation: RELU_6,
regularizer {
l2_regularizer {
weight: 0.00004
}
}
initializer {
truncated_normal_initializer {
stddev: 0.03
mean: 0.0
}
}
}
}
}
feature_extractor {
type: 'ssd_inception_v2'
min_depth: 16
depth_multiplier: 1.0
conv_hyperparams {
activation: RELU_6,
regularizer {
l2_regularizer {
weight: 0.00004
}
}
initializer {
truncated_normal_initializer {
stddev: 0.03
mean: 0.0
}
}
batch_norm {
train: true,
scale: true,
center: true,
decay: 0.9997,
epsilon: 0.001,
}
}
override_base_feature_extractor_hyperparams: true
}
loss {
classification_loss {
weighted_sigmoid {
}
}
localization_loss {
weighted_smooth_l1 {
}
}
hard_example_miner {
num_hard_examples: 3000
iou_threshold: 0.99
loss_type: CLASSIFICATION
max_negatives_per_positive: 3
min_negatives_per_image: 0
}
classification_weight: 1.0
localization_weight: 1.0
}
normalize_loss_by_num_matches: true
post_processing {
batch_non_max_suppression {
score_threshold: 1e-8
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 100
}
score_converter: SIGMOID
}
}
}
train_config: {
batch_size: 4
optimizer {
rms_prop_optimizer: {
learning_rate: {
exponential_decay_learning_rate {
initial_learning_rate: 0.0004
decay_steps: 5000
decay_factor: 0.99
}
}
momentum_optimizer_value: 0.9
decay: 0.9
epsilon: 1.0
}
}
fine_tune_checkpoint: "pre-trained-model/model.ckpt"
from_detection_checkpoint: true
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
num_steps: 200000
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
ssd_random_crop {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "annotations/train.record"
}
label_map_path: "annotations/label_map.pbtxt"
}
eval_config: {
num_examples: 5
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
max_evals: 5
}
eval_input_reader: {
tf_record_input_reader {
input_path: "annotations/test.record"
}
label_map_path: "annotations/label_map.pbtxt"
shuffle: false
num_readers: 1
}
I guess the model is not get trained because the tensorboard looks like this:
Well, any idea how to make the training start?
Try to add --num_train_steps=10 to your cmd.
Well, after resize the images to 600 * 300, things works.

Understanding peaked/curved results in mAP and Loss during object detector training

I am working on training the object detector with a custom dataset designed to detect the head of a plant. I am using the "Faster R-CNN with Resnet-101 (v1)" that was originally designed for the pet dataset.
I modified the config file to match my dataset (1875 training/375 eval) of images that 275x550 in size. I converted all record files. And the pipeline file is shown below.
I trained on a gpu overnight for 100k steps and the actual evaluation results look really good. It detects all the plant heads and the data is really useful.
The issue is the actual metrics. When checking the tensorboard logs for the eval, all the metrics increase until 30k steps and then drop again making a nice hump in the middle. This goes for the loss, mAP, and precision results.
Why is this result happening? I assumed that if you keep training, the metrics should just flatten out to a line and not just decrease downwards again.
mAP Evaluation: https://imgur.com/a/hjobr6c
Loss Evaluation: https://imgur.com/a/EY8Afqc
# Faster R-CNN with Resnet-101 (v1) originally for Oxford-IIIT Pets Dataset. Modified for wheat head detection
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "" to find the fields that
# should be configured.
model {
faster_rcnn {
num_classes: 1
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 275
max_dimension: 550
}
}
feature_extractor {
type: 'faster_rcnn_resnet101'
first_stage_features_stride: 16
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 300
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0003
schedule {
step: 900000
learning_rate: .00003
}
schedule {
step: 1200000
learning_rate: .000003
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: "object_detection/faster_rcnn_resnet101_coco_11_06_2017/model.ckpt"
from_detection_checkpoint: true
load_all_detection_checkpoint_vars: true
# Note: The below line limits the training process to 200K steps, which we
# empirically found to be sufficient enough to train the pets dataset. This
# effectively bypasses the learning rate schedule (the learning rate will
# never decay). Remove the below line to train indefinitely.
num_steps: 200000
data_augmentation_options {
random_horizontal_flip {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "object_detection/data_wheat/train.record-?????-of-00010"
}
label_map_path: "object_detection/data_wheat/wheat_label_map.pbtxt"
}
eval_config: {
metrics_set: "coco_detection_metrics"
num_examples: 375
}
eval_input_reader: {
tf_record_input_reader {
input_path: "object_detection/data_wheat/val.record-?????-of-00010"
}
label_map_path: "object_detection/data_wheat/wheat_label_map.pbtxt"
shuffle: false
num_readers: 1
}
This is a standard case of overfitting: your model is memorizing the training data and lost its ability to generalize on unseen data.
For cases like this one you have two options:
early stopping: monitor the validation metrics and as soon as the metrics become constants and/or starts decreasing stop the training
add regularization to the model (and also do early stopping anyway)