How to modify ssd mobilenet config to detect small objects using tensorflow object detection API? - tensorflow

I am trying to detect small objects from ipcam videostreams using ssd mobilenetv2. The model was trained on the high resolution images of these small objects where the objects are very close to the camera.Images were downloaded from internet.
I found that changing the anchorbox scales and modifying feature extractor.py are the proposed solutions to overcome this.
Can anyone guide me how to do this?

mobilenet-ssd - is great for large objects, yet its performance for small objects is pretty poor.
It is always better to train with anchors tuned to the objects aspect ratios, and sizes you expect.
One more thing to take into account is that the first branch is the one which detects the smallest objects - the resolution of this branch is 1/16 of the input - you should consider adding another branch at the 1/8 feature map - which will help with small objects.
How to change anchors sizes and aspect ratios:
Let us take for example the pipeline.config file which is being used for the training configuration - https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/ssd_mobilenet_v2_coco.config.
You will find there the following arguments:
90 anchor_generator {
91 ssd_anchor_generator {
92 num_layers: 6
93 min_scale: 0.20000000298
94 max_scale: 0.949999988079
95 aspect_ratios: 1.0
96 aspect_ratios: 2.0
97 aspect_ratios: 0.5
98 aspect_ratios: 3.0
99 aspect_ratios: 0.333299994469
100 }
101 }
num_layers - number of branches - starts from a branch from 1/16 of the input...
min_scale / max_scale - min_scale corresponds to the scale of the anchors in the first branch, max_scale corresponds to the scale of the last branch. While all the branches between gets scale from linear interpolation:
min_scale + (max_scale - min_scale)/(num_layers - 1) * (#branch) (same as defined in SSD: Single Shot MultiBox Detector - https://arxiv.org/pdf/1512.02325.pdf)
aspect_ratios - list of aspect ratios define the anchors - this way you can decide what AR anchors to add, AR=1.0 means a square anchor, while 2.0 means that the anchor is landscape - while its width is x2 the height, 0.5 means portrait where the height is x2 the width...
the code can be find in the following path:
https://github.com/tensorflow/models/blob/master/research/object_detection/anchor_generators/grid_anchor_generator.py
and
https://github.com/tensorflow/models/blob/master/research/object_detection/anchor_generators/multiscale_grid_anchor_generator.py
One more thing is that in mobilenet-v1-ssd - the first branch has only 3 anchors, i'm not sure how much mobilenet-v2-ssd has, but you may want to add more anchors. You will need to change it in the code (in multiple_grid_anchor_generator.py)
320 if layer == 0 and reduce_boxes_in_lowest_layer:
321 layer_box_specs = [(0.1, 1.0), (scale, 2.0), (scale, 0.5)]
as you seed it is hard coded to be three anchors...
How to start the branches earlier
This also would be needed to be changed inside the code. Each predefined model has its own model file - i.e. ssd_mobilenet_v2:
https://github.com/tensorflow/models/blob/master/research/object_detection/models/ssd_mobilenet_v2_feature_extractor.py
lines 111:117
feature_map_layout = {
'from_layer': ['layer_15/expansion_output', 'layer_19', '', '', '', ''
][:self._num_layers],
'layer_depth': [-1, -1, 512, 256, 256, 128][:self._num_layers],
'use_depthwise': self._use_depthwise,
'use_explicit_padding': self._use_explicit_padding,
}
You can choose what layers to start from by their name.
Now for my 2 cents, I didn't try mobilenet-v2-ssd, mainly used mobilenet-v1-ssd, but from my experience is is not a good model for small objects. I guess it can be optimized a little bit by editing the anchors, but not sure if it will be sufficient for your needs. for one stage ssd like network consider using ssd_mobilenet_v1_fpn_coco - it works on 640x640 input size, and its first branch is starts at 1/8 input size. (cons - bigger model, and higher inference time)

Related

Understanding 2D convolution output size

I am a beginner in Convolutional DL. I saw the following architecture in paper Simultaneous Feature Learning and Hash Coding with Deep Neural Networks: For images of size 256*256,
I do not understand the output size of the first 2D convolution: 96*54*54. 96 seems fine as the number of filters is 96. But, if we apply the following formula for the output size: size = [(W−K+2P)/S]+1 = [(256 - 11 + 2*0)/4] + 1 = 62.25 ~ 62. I have assumed the padding, P to be 0 as it is not mentioned in the paper anywhere. Keras Conv2D API produces the same 96*62*62 size output. Then, why paper points to 96*54*54? What am I missing?
Well, it reminded me AlexNet paper where there was a similar mistake. Your calculation is correct. I think they mistakenly write 256x256 instead of 224x224, in which case the calculation for the input layer is,
(224-11+2*0)/4 + 1 = 54.25 ~ 54
It's highly possible that authors mistakenly wrote 256x256 instead of the real architecture input size being 224x224 (that was the case in AlexNet also), or the other less possible option is they wrote 256x256 which was the real architecture input size, but do the calculations for 224x224. The latter is ignorable as I think it is a very silly mistake and I don't think that's even an option.
Thus, I believe the true input size was 224x224 instead of 256x256.

Input feature to Feature maps

Can anybody please explain this basic thing to me that how does a 192x28x28 input image gets reduced to a 16x28x28 feature maps using a 1x1 conv mapping. My question is about the understanding of what exactly happens when 192 goes to 16 ??
i know about ((I-2P-F)/S)+1, but what happens in the process of reducing depth.
The 1x1 Convolution compresses the whole 192*28*28 input image (which could be read as 192 feature maps of 28px * 28px pixels images) into a single 1*28*28 image. So far it reduces depth in the "feature map axis" to 1 while preserving the height and width of the original image.
But then... why do you get the 16? In a convolutional layer you can have different kernels. Basically each kernel is an indepentent filter with the same size. In your case it looks like your 1x1 Conv layer has 16 kernels by default, hence you get 16 28*28 images (one per kernel).

Matterport Mask R-CNN - Unpredictable loss values and strange detection results for larger images

I'm having trouble achieving viable results with Mask R-CNN and I can't seem to pinpoint why. I am using a fairly limited dataset (13 images) of large greyscale images (2560 x 2160) where the detection target is very small (mean area of 26 pixels). I have run inspect_nucleus_data.ipynb across my data and verified that the masks and images are being interpreted correctly. I've also followed the wiki guide (https://github.com/matterport/Mask_RCNN/wiki) to have my images read and dealt with as greyscale images rather than just converting them to RGB. Here is one of the images with the detection targets labelled.
During training, the loss values are pretty unpredictable, bouncing between around 1 and 2 without ever reaching a steady decline where it seems like it's converging at all. I'm using these config values at the moment; they're the best I've been able to come up with while fighting off OOM errors:
Configurations:
BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BATCH_SIZE 1
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE None
DETECTION_MAX_INSTANCES 450
DETECTION_MIN_CONFIDENCE 0
DETECTION_NMS_THRESHOLD 0.3
FPN_CLASSIF_FC_LAYERS_SIZE 1024
GPU_COUNT 1
GRADIENT_CLIP_NORM 5.0
IMAGES_PER_GPU 1
IMAGE_CHANNEL_COUNT 1
IMAGE_MAX_DIM 1024
IMAGE_META_SIZE 14
IMAGE_MIN_DIM 1024
IMAGE_MIN_SCALE 0
IMAGE_RESIZE_MODE square
IMAGE_SHAPE [1024 1024 1]
LEARNING_MOMENTUM 0.9
LEARNING_RATE 0.001
LOSS_WEIGHTS {'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0, 'rpn_class_loss': 1.0}
MASK_POOL_SIZE 14
MASK_SHAPE [28, 28]
MAX_GT_INSTANCES 450
MEAN_PIXEL [16.49]
MINI_MASK_SHAPE (56, 56)
NAME nucleus
NUM_CLASSES 2
POOL_SIZE 7
POST_NMS_ROIS_INFERENCE 1000
POST_NMS_ROIS_TRAINING 2000
PRE_NMS_LIMIT 6000
ROI_POSITIVE_RATIO 0.33
RPN_ANCHOR_RATIOS [0.5, 1, 2]
RPN_ANCHOR_SCALES (2, 4, 8, 16, 32)
RPN_ANCHOR_STRIDE 1
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD 0.9
RPN_TRAIN_ANCHORS_PER_IMAGE 512
STEPS_PER_EPOCH 11
TOP_DOWN_PYRAMID_SIZE 256
TRAIN_BN False
TRAIN_ROIS_PER_IMAGE 256
USE_MINI_MASK True
USE_RPN_ROIS True
VALIDATION_STEPS 1
WEIGHT_DECAY 0.0001
I'm training on all layers. The output I'm getting generally looks like this, with grid-like detections found in weird spots without ever seeming to accurately identify a nucleus. I've added the red square just to highlight a very obvious cluster of nuclei that have been missed:
Here is a binary mask of these same detections so you can see their shape:
Could anyone shed some light on what might be going wrong here?
Some information is lost when you resize the training image dimension by half.
Square also uses a lot of memory. So you might want to use crop instead. Instead of 1024*1024 in square mode, you will have 512*512. You might encounter NAN due to bounding box out of range, in that case you need to make some adjustment to your data feed.
You want to turn off mini mask because that will influence with your accuracy. Using crop mode should help out a ton with memory. So you shouldn't worry.

Tuning first_stage_anchor_generator in faster rcnn model

I am trying to detect some very small object (~25x25 pixels) from large image (~ 2040, 1536 pixels) using faster rcnn model from object_detect_api from here https://github.com/tensorflow/models/tree/master/research/object_detection
I am very confused about the following configuration parameters(I have read the proto file and also tried modify them and test):
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
I am kind of very new to this area, if some one can explain a bit about these parameters to me it would be very appreciated.
My Question is how should I adjust above (or other) parameters to accommodate for the fact that I have very small fix-sized objects to detect in large image.
Thanks
I don't know the actual answer, but I suspect that the way Faster RCNN works in Tensorflow object detection is as follows:
this article says:
"Anchors play an important role in Faster R-CNN. An anchor is a box. In the default configuration of Faster R-CNN, there are 9 anchors at a position of an image. The following graph shows 9 anchors at the position (320, 320) of an image with size (600, 800)."
and the author gives an image showing an overlap of boxes, those are the proposed regions that contain the object based on the "CNN" part of the "RCNN" model, next comes the "R" part of the "RCNN" model which is the region proposal. To do that, there is another neural network that is trained alongside the CNN to figure out the best fit box. There are a lot of "proposals" where an object could be based on all the boxes, but we still don't know where it is.
This "region proposal" neural net's job is to find the correct region and it is trained based on the labels you provide with the coordinates of each object in the image.
Looking at this file, I noticed:
line 174: heights = scales / ratio_sqrts * base_anchor_size[0]
line 175: widths = scales * ratio_sqrts * base_anchor_size[[1]]
which seems to be the final goal of the configurations found in the config file(to generate a list of sliding windows with known widths and heights). While the base_anchor_size is created as a default of [256, 256]. In the comments the author of the code wrote:
"For example, setting scales=[.1, .2, .2]
and aspect ratios = [2,2,1/2] means that we create three boxes: one with scale
.1, aspect ratio 2, one with scale .2, aspect ratio 2, and one with scale .2
and aspect ratio 1/2. Each box is multiplied by "base_anchor_size" before
placing it over its respective center."
which gives insight into how these boxes are created, the code seems to be creating a list of boxes based on the scales =[stuff] and aspect_ratios = [stuff] parameters that will be used to slide over the image. The scale is fairly straightforward and is how much the default square box of 256 by 256 should be scaled before it is used and the aspect ratio is the thing that changes the original square box into a rectangle that is more closer to the (scaled) shape of the objects you expect to encounter.
Meaning, to optimally configure the scales and aspect ratios, you should find the "typical" sizes of the object in the image whatever it is ex(20 by 30, 5 by 10 ,etc) and figure out how much the default of 256 by 256 square box should be scaled to optimally fit that, then find the "typical" aspect ratios of your objects(according to google an aspect ratio is: the ratio of the width to the height of an image or screen.) and set those as your aspect ratio parameters.
Note: it seems that the number of elements in the scales and aspect_ratios lists in the config file should be the same but I don't know for sure.
Also I am not sure about how to find the optimal stride, but if your objects are smaller than 16 by 16 pixels the sliding window you created by setting the scales and aspect ratios to what you want might just skip your object altogether.
As I believe proposal anchors are generated only for model types of Faster RCNN. In this file you have specified what parameters may be set for anchors generation within line you mentioned from config.
I tried setting base_anchor_size, however I failed. Though this FasterRCNNTutorial tutorial mentions that:
[...] you also need to configure the anchor sizes and aspect ratios in the .config file. The base anchor size is 255,255.
The anchor ratios will multiply the x dimension and divide the y dimension, so if you have an aspect ratio of 0.5 your 255x255 anchor becomes 128x510. Each aspect ratio in the list is applied, then the results are multiplied by the scales. So the first step is to resize your images to the training/testing size, then manually check what the smallest and largest objects you expect are, and what the most extreme aspect ratios will be. Set up the config file with values that will cover these cases when the base anchor size is adjusted by the aspect ratios and multiplied by the scales.
I think it's pretty straightforward. I also used this 'workaround'.

In Tensorflow, how to convert scores from neural net into discrete values as a part of learning process

Hello fellow tensorflowians!
I have a following schema:
I input some continous variables (actually, word embeddings I took from google word2vec), and I am trying to predict output that can be considered as continous as well as discrete (sorry, mathematicians! but it depends on one's training goal actually).
Output takes values from 0 to 1000 with interval of 0.25 (or a precision hyperparameter), so : 0, 0.25, 0.50, ..., 100.0 .
I know that it is not possible to include something like tf.to_int (I can omit fractions if it's necessary) or tf.round, because these are not differentiable, so we can't backpropagate. However, I feel that there is some solution that allows network to "know" that it is searching for rounded solution: some small fractions of integers like 0.25, 5.75, but I actually don't even know where to look. I looked up quantization, but that seems to be a bit of an overkill.
So my question is:
How to inform graph that we don't accept values below 0.0 ? Would doing abs on network output "logits" (regression predictions) be something worth considering? If no, can I modify the loss term to severely punish scores below 0 and using absolute error instead of squared error? I may be not aware of full consequences of doing that
I don't care whether prediction of 4.5 is 4.49999 or 4.4 because I round up predictions to nearest .25 to get accuracy, and that's my final model evaluation metric. If so, can I use?
precision = 0.01 # so that sqrt(precision) == 0.1
loss=tf.reduce_mean(tf.max(0, tf.square(tf.sub(logits, targets)) - precision ))