Tensorflow object detection mask rcnn uses too much memory - tensorflow

I am trying to run TF object detection with mask rcnn, but it keeps dying on a node with 500GB of memory.
I updated the models/research/object_detection/trainer.py ConfigProto to
session_config = tf.ConfigProto(allow_soft_placement=True,
intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1,
device_count = {'CPU': 1},
log_device_placement=False)
I updated the mask_rcnn_inception_resnet_v2_atrous_coco.config to
train_config: {
batch_queue_capacity: 500
num_batch_queue_threads: 8
prefetch_queue_capacity: 10
Updating the ConfigProto has had the best effect so far. I got it all the way to 30 steps before it died instead of 1. I'm reducing the values in the train_config by half for this run. I have also reduced the number of images and objects significantly.
Any other ideas?

500GB is a good amount of memory. I have had issues with running out of GPU memory, which is a separate constraint.
For TensorFlow v2, I have found the following useful:
1. Reduce batch_size to a small value
In the config file, set:
train_config: {
batch_size: 4
...
}
batch_size can be as low as 1.
2. Reduce the dimensions of resized images
In the config file, set the resizer height and width to a value lower than the default of 1024x1024.
model {
faster_rcnn {
number_of_stages: 3
num_classes: 1
image_resizer {
fixed_shape_resizer {
height: 256
width: 256
}
}
3. Don't train the Feature Detector
This only applies to Mask R-CNN, and is the most difficult change to implement. In the file research/object_detection/model_lib_v2.py, change the following code:
Current:
def eager_train_step(detection_model,
...
trainable_variables = detection_model.trainable_variables
gradients = tape.gradient(total_loss, trainable_variables)
if clip_gradients_value:
gradients, _ = tf.clip_by_global_norm(gradients, clip_gradients_value)
optimizer.apply_gradients(zip(gradients, trainable_variables))
New:
def eager_train_step(detection_model,
...
# Mask R-CNN variables to train -- not feature detector
trainable_variables = detection_model.trainable_variables
to_fine_tune = []
prefixes_to_train = ['FirstStageBoxPredictor',
'mask_rcnn_keras_box_predictor',
'RPNConv'
]
for var in trainable_variables:
if any([var.name.startswith(prefix) for prefix in prefixes_to_train]):
to_fine_tune.append(var)
gradients = tape.gradient(total_loss, to_fine_tune)
if clip_gradients_value:
gradients, _ = tf.clip_by_global_norm(gradients, clip_gradients_value)
optimizer.apply_gradients(zip(gradients, to_fine_tune))
There are implications to each of these changes. However, they may allow for a "good enough" result using scarce resources.

I had a similar issue. I managed to reduce memory consumption by another factor of 2.5x by setting the following values:
prefetch_size: 4
num_readers: 4
min_after_dequeue: 1
I am not sure which of them (maybe all?) are responsible for reducing the memory, (i did not test that) or how much their exact values influence the memory consumption, but you can easily try that out.

Some of the options that previously worked to reduce memory usage have been deprecated. From object_detection/protos/input_reader.proto:
optional uint32 queue_capacity = 3 [default=2000, deprecated=true];
optional uint32 min_after_dequeue = 4 [default=1000, deprecated=true];
optional uint32 prefetch_size = 13 [default = 512, deprecated=true];
optional uint32 num_parallel_map_calls = 14 [default = 64, deprecated=true];
As of today, num_parallel_batches appears to be the larges memory hog.
The *_input_reader messages my config file now looks like this:
train_input_reader: {
tf_record_input_reader {
input_path: "<DATASET_DIR>/tfrecords/train*.tfrecord"
}
label_map_path: "<DATASET_DIR>/label_map.pbtxt"
load_instance_masks: true
mask_type: PNG_MASKS
num_parallel_batches: 1
}
Mask RCNN training now uses ~50% less CPU memory than before (training on 775 x 522 images).

Related

How to restore a fine-tuned model with Tensorflow 2 Object Detection API for testing?

I have trained (fine-tuned) successfully and validated object detection model from Tensorflow Model Zoo 2, with this config:
...
train_input_reader: {
label_map_path: "/label_map.pbtxt"
tf_record_input_reader {
input_path: "/train.record"
}
}
eval_config: {
metrics_set: "coco_detection_metrics" #coco_detection_metrics
use_moving_averages: false
batch_size: 1;
}
eval_input_reader: {
label_map_path: "/label_map.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "/validation.record"
}
}
...
Then I noticed by analyzing the performance on Tensorboard that the best model based on eval loss is at step 13k i.e. ckpt-14.
However, I also have /test.record on which I want to test the model based on ckpt.14. What could I do? I tried to create a separate folder with ckpt-14.index e ckpt-14.data-... and the file named "checkpoint" containing only ckpt-14 and its timestamp and then launched the evaluation process by replacing validation.record with test.record. in tf_record_input_reader.
It's correct? is there a proper way to testing a model based on a checkpoint with tensorflow 2 object detection api?
You can train and test on the same model simultaneously......But if you have a single GPU, and training with a large dataset, it may not be possible to run testing with the same GPU, as it would result in memory errors.....One good way is to to use the same code and use a work around to do the testing using CPU.......The testing cycle takes place once every 1000 steps and on Tensorboard, you can see both test and eval, and you will also see the bounding boxes with the ground truth side-by-side......
I will try to share the codes for concurrent training and testing.....For training, it will use the GPU, and for testing it will use the CPU.....It has been working for me and no doubt, it should work for you too.....

Initialize the bias term of the last layer

I was reading the this blog about focal loss. In the section Focal Loss Trick it says:
Facebook AI Research used is to initialize the bias term of the last
layer to some non-zero value such that the pt of positive samples is
small and the pt of negative samples is large. Concretely, they set
the bias term b=−log((1−π)/π). Here π is simply are variable instead
of the ordinary π. In their case, they set π=0.01, therefore b≫wx.
I want to do the same using tensorflow object detection api. Here, the focal loss is given by the following line in config file:
loss {
classification_loss {
weighted_sigmoid_focal {
alpha: 0.25
gamma: 2.0
}
} }
But I don't know how to set the bias term of the last layer to some non-zero value. How to achieve it in tensorflow ?
It's given by class_prediction_bias_init in the box_predictor. So, the config file will look something like this:
box_predictor {
weight_shared_convolutional_box_predictor {
class_prediction_bias_init: -1.99
}
}

Blocking of tf.contrib.StagingArea get() and put() operations

Work environment
TensorFlow release version : 1.3.0-rc2
TensorFlow git version : v1.3.0-rc1-994-gb93fd37
Operating System : CentOS Linux release 7.2.1511 (Core)
Problem Scenario
I am using TensorFlow StagingArea ops for increasing the efficiency of my input pipeline. Here is a part of my code snippet which constructs the input pipeline :
train_put_op_list = []
train_get_op_list = []
val_put_op_list = []
val_get_op_list = []
with tf.variable_scope(tf.get_variable_scope()) as vscope:
for i in range(4):
with tf.device('/gpu:%d'%i):
with tf.name_scope('GPU-Tower-%d'%i) as scope:
trainstagingarea = tf.contrib.staging.StagingArea(dtypes=[tf.float32, tf.int32],
shapes=[[64, 221, 221, 3],[64]],
capacity=0)
valstagingarea = tf.contrib.staging.StagingArea(dtypes=[tf.float32, tf.int32],
shapes=[[128, 221, 221, 3],[128]],
capacity=0)
train_put_op_list.append(trainstagingarea.put(train_iterator.get_next()))
val_put_op_list.append(valstagingarea.put(val_iterator.get_next()))
train_get_op_list.append(trainstagingarea.get())
val_get_op_list.append(valstagingarea.get())
with tf.device('/cpu:0'):
worktype = tf.get_variable("wt",[], initializer=tf.zeros_initializer(), trainable=False)
workcondition = tf.equal(worktype, 1)
#elem = tf.cond(workcondition, lambda: train_iterator.get_next(), lambda: val_iterator.get_next())
elem = tf.cond(workcondition, lambda: train_get_op_list[i], lambda: val_get_op_list[i])
# This is followed by the network construction and optimizer
Now at the time of execution, I first run the put() ops a couple of times and then go on to run the iterations. It is shown below :
with tf.Session(config=config) as sess:
sess.run(init_op)
sess.run(iterator_training_op)
sess.run(iterator_validation_op)
sess.run(tf.assign(worktype, 0))
for i in range(4):
sess.run(train_put_op_list)
sess.run(val_put_op_list)
writer = tf.summary.FileWriter('.', graph=tf.get_default_graph())
epoch = 0
iter = 0
previous = 0
while(epoch<10):
try:
if(PROCESSINGTYPE is 'validation'):
sess.run(val_put_op_list)
[val_accu, summaries, numsamp] = sess.run([running_accuracy, validation_summary_op, processed])
previous+=numsamp
print("Running Accuracy = {} : Number of sample processed = {} ".format(val_accu, previous))
else:
sess.run(train_put_op_list)
[loss_value, _, train_accu, summaries, batch_accu, numsamp] = sess.run([total_loss, apply_gradient_op, running_accuracy, training_summary_op, batch_accuracy, pr\
ocessed])
#Remaining part of the code (not important for question)
Problem Description
The use of StagingArea improves the speed substantially (almost 3-4 times).
However, the code hangs due to some block. I am not sure if the block comes from get() or put() operations. Here is the actual output :
# Validation is done first and the following is the output
Running Accuracy = 0.0 : Number of sample processed = 512
Running Accuracy = 0.00390625 : Number of sample processed = 1024
Running Accuracy = 0.0 : Number of sample processed = 1536
Running Accuracy = 0.001953125 : Number of sample processed = 2048
# The code hangs here
You can notice that in the beginning of tf.Session() as sess:, the get() and put() ops were run for 4 times. The output is limited to 4 lines as well. This means that,
sess.run(val_put_op_list) within the while loop does not do anything. So, when the get() is called by sess.run(running_accuracy)..., the StagingArea is found empty after 4 lines and hence a blocking happens.
Am I correct in my analysis of the problem ?
What is the correct way to use the get() and put() ops here ?
If StagingArea is full and put() is blocked, would that also block the whole code ? TensorFlow documentation does not say anything about it.
Take a look at https://github.com/tensorflow/tensorflow/pull/13684. This resolves some deadlocks and will likely go into 1.4.0. Disclaimer: am not a tensorflower.

What the impact of different dimension of image resizer when using default config of object detection api

I was trying to use the object detection API of Tensorflow to train a model.
And I was using the sample config of faster rcnn resnet101 (https://github.com/tensorflow/models/blob/master/object_detection/samples/configs/faster_rcnn_resnet101_voc07.config).
The following code was part of the config file I didn't quite understand:
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
My questions were:
What was the exact meaning of min_dimension and max_dimension? Did it mean the size of input image would be resized to 600x1024 or 1024x600?
If I had different size of image and maybe some of them are relatively larger than 600x1024 (or 1024x600), could/should I increase the value of min_dimension and max_dimension?
The reason why I had such question was from this post:
TensorFlow Object Detection API Weird Behaviour
In this post, the author itself gave an answer to the question:
Then I decided to crop the input image and provide that as an input. Just to see if the results improve and it did!
It turns out that the dimensions of the input image were much larger than the 600 x 1024 that is accepted by the model. So, it was scaling down these images to 600 x 1024 which meant that the cigarette boxes were losing their details :)
It used the same config as I used.
And I was not sure if I could change these parameters if they were default or recommended setting to this special model, faster_rcnn_resnet101.
After some tests, I guess I find the answer. Please correct me if there is anything wrong.
In .config file:
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
According to the image resizer setting of 'object_detection/builders/image_resizer_builder.py'
if image_resizer_config.WhichOneof(
'image_resizer_oneof') == 'keep_aspect_ratio_resizer':
keep_aspect_ratio_config = image_resizer_config.keep_aspect_ratio_resizer
if not (keep_aspect_ratio_config.min_dimension
<= keep_aspect_ratio_config.max_dimension):
raise ValueError('min_dimension > max_dimension')
return functools.partial(
preprocessor.resize_to_range,
min_dimension=keep_aspect_ratio_config.min_dimension,
max_dimension=keep_aspect_ratio_config.max_dimension)
Then it tries to use 'resize_to_range' function of 'object_detection/core/preprocessor.py'
with tf.name_scope('ResizeToRange', values=[image, min_dimension]):
image_shape = tf.shape(image)
orig_height = tf.to_float(image_shape[0])
orig_width = tf.to_float(image_shape[1])
orig_min_dim = tf.minimum(orig_height, orig_width)
# Calculates the larger of the possible sizes
min_dimension = tf.constant(min_dimension, dtype=tf.float32)
large_scale_factor = min_dimension / orig_min_dim
# Scaling orig_(height|width) by large_scale_factor will make the smaller
# dimension equal to min_dimension, save for floating point rounding errors.
# For reasonably-sized images, taking the nearest integer will reliably
# eliminate this error.
large_height = tf.to_int32(tf.round(orig_height * large_scale_factor))
large_width = tf.to_int32(tf.round(orig_width * large_scale_factor))
large_size = tf.stack([large_height, large_width])
if max_dimension:
# Calculates the smaller of the possible sizes, use that if the larger
# is too big.
orig_max_dim = tf.maximum(orig_height, orig_width)
max_dimension = tf.constant(max_dimension, dtype=tf.float32)
small_scale_factor = max_dimension / orig_max_dim
# Scaling orig_(height|width) by small_scale_factor will make the larger
# dimension equal to max_dimension, save for floating point rounding
# errors. For reasonably-sized images, taking the nearest integer will
# reliably eliminate this error.
small_height = tf.to_int32(tf.round(orig_height * small_scale_factor))
small_width = tf.to_int32(tf.round(orig_width * small_scale_factor))
small_size = tf.stack([small_height, small_width])
new_size = tf.cond(
tf.to_float(tf.reduce_max(large_size)) > max_dimension,
lambda: small_size, lambda: large_size)
else:
new_size = large_size
new_image = tf.image.resize_images(image, new_size,
align_corners=align_corners)
From the above code, we can know if we have an image whose size is 800*1000. The size of final output image will be 600*750.
That is, this image resizer will always resize your input image according to the setting of 'min_dimension' and 'max_dimension'.

tensorflow with multi-gpu and tf.RandomShuffleQueue

I am trying to modify the code of mask rcnn to run it on multi-gpu, based on the sample of cifar10, the most part of code is below
One image and ground truth infomation is read from TFRecords file as below
image, ih, iw, gt_boxes, gt_masks, num_instances, img_id = \
datasets.get_dataset(FLAGS.dataset_name,
FLAGS.dataset_split_name,
FLAGS.dataset_dir,
FLAGS.im_batch,
is_training=True)
Here the size of image and num_instance is different among images, then these inputs are stored in an RandomShuffleQueue as below
data_queue = tf.RandomShuffleQueue(capacity=32, min_after_dequeue=16,
dtypes=(
image.dtype, ih.dtype, iw.dtype,
gt_boxes.dtype, gt_masks.dtype,
num_instances.dtype, img_id.dtype))
enqueue_op = data_queue.enqueue((image, ih, iw, gt_boxes, gt_masks, num_instances, img_id))
data_queue_runner = tf.train.QueueRunner(data_queue, [enqueue_op] * 4)
tf.add_to_collection(tf.GraphKeys.QUEUE_RUNNERS, data_queue_runner)
the I use tower_grads to gather the gradients in each GPU, then average them, below is the code for multi-gpu
tower_grads = []
num_gpus = 2
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('tower_%d' % i) as scope:
(image, ih, iw, gt_boxes, gt_masks, num_instances, img_id) = data_queue.dequeue()
im_shape = tf.shape(image)
image = tf.reshape(image, (im_shape[0], im_shape[1], im_shape[2], 3))
total_loss = compute_loss() # use tensor from dequeue operation to compute loss
grads = compute_grads(total_loss)
tower_grads.append(grads)
grads = average_grads(tower_grads)
when num_gpus=1, the code works well(I mean there is no error), but when I use two TITAN X GPUs, there are some strange errors below
failed to enqueue async me mset operation: CUDA_ERROR_INVALID_HANDLE
Internal: Blas GEMM launch failed
and the error is not the same when you run the code several times. I can't figure out why these errors occur for multi-gpu, some conflicts on data queue or GPUs?