Matterport Mask R-CNN - Unpredictable loss values and strange detection results for larger images - tensorflow

I'm having trouble achieving viable results with Mask R-CNN and I can't seem to pinpoint why. I am using a fairly limited dataset (13 images) of large greyscale images (2560 x 2160) where the detection target is very small (mean area of 26 pixels). I have run inspect_nucleus_data.ipynb across my data and verified that the masks and images are being interpreted correctly. I've also followed the wiki guide ( to have my images read and dealt with as greyscale images rather than just converting them to RGB. Here is one of the images with the detection targets labelled.
During training, the loss values are pretty unpredictable, bouncing between around 1 and 2 without ever reaching a steady decline where it seems like it's converging at all. I'm using these config values at the moment; they're the best I've been able to come up with while fighting off OOM errors:
BACKBONE resnet101
BACKBONE_STRIDES [4, 8, 16, 32, 64]
BBOX_STD_DEV [0.1 0.1 0.2 0.2]
IMAGE_SHAPE [1024 1024 1]
LOSS_WEIGHTS {'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0, 'rpn_class_loss': 1.0}
MASK_SHAPE [28, 28]
MEAN_PIXEL [16.49]
NAME nucleus
RPN_ANCHOR_SCALES (2, 4, 8, 16, 32)
RPN_BBOX_STD_DEV [0.1 0.1 0.2 0.2]
I'm training on all layers. The output I'm getting generally looks like this, with grid-like detections found in weird spots without ever seeming to accurately identify a nucleus. I've added the red square just to highlight a very obvious cluster of nuclei that have been missed:
Here is a binary mask of these same detections so you can see their shape:
Could anyone shed some light on what might be going wrong here?

Some information is lost when you resize the training image dimension by half.
Square also uses a lot of memory. So you might want to use crop instead. Instead of 1024*1024 in square mode, you will have 512*512. You might encounter NAN due to bounding box out of range, in that case you need to make some adjustment to your data feed.
You want to turn off mini mask because that will influence with your accuracy. Using crop mode should help out a ton with memory. So you shouldn't worry.


Does the Kernel slide over each time dimension individually in Conv1D convolutions?

I am dying to understand one question that I can not find any answer to:
When doing Conv1D on a multivariate time-series - is the KERNEL convolved across ALL dimensions or for each dimension individually? IS the size of the kernel [kernel_size x 1] or [kernel_size x num_dims]?
The thing is that I input a 800 by 10 time series into a Conv1D(filter =16,kernel_size=6)
And I get 800 by 16 as output, whereas I would expect to get 800 by 16 by 10 , because each time series dimension is convolved with the filter individually.
What is the case?
Edit: Toy example for discussion:
We have a 3 input channels, 800 time steps long. We have a kernel of 6 time steps width meaning the effective kernel dimensions are [3,1,6].
Each time step, 6 timesteps in each channel are convolved with the kernel. Then all the kernels elements are summed.
If this is correct - what is 1D about this convolution, if the image of the convolution operation is clearly 2-dimensional with [3 x 6] ?
When you convolve an "image" with multiple channels you sum across all the channels and then stack up filters you use to get a new "image" with (# of filters) channels. The thing that's a bit difficult for some people to understand is that the filter itself is actually (kernel_size x 1 x Number of channels). In other words your filters have depth.
So given that you're inputting this as a 800 x 1 "image" with 10 channels, you will end up with an 800 x 1 x 16 image, since you stack 16 filters. Of course the 1s aren't really important for conv1d and can be ignored, so tl;dr 800 x 6 -> 800 x 16 in this case.
Response to part 2:
We have a 3 input channels, 800 time steps long. We have a kernel of 6 time steps width meaning the effective kernel dimensions are [3,1,6].
This is essentially correct.
Each time step, 6 timesteps in each channel are convolved with the kernel. Then all the kernels elements are summed.
Yes, this is essentially correct. We end up with a slightly smaller image as we'll repeat this operation each time we slide the kernel over this timestep, giving us a 700 and something by 1 by 1 new image. We the repeat this operation # of filters times, and stack these on top of each other. This is still in the third dimension, so we end up with 7xx by 1 by (# of filters).
If this is correct - what is 1D about this convolution, if the image of the convolution operation is clearly 2-dimensional with [3 x 6] ?
For something to require Conv2d, it needs to have a 2nd dimension value greater than 1. For example, a color photograph might be 224 x 224 and have 3 color channels so it'd be 224 x 224 by 3.
Notably when we perform Conv2D, we also are sliding our kernel in an additional direction, for example, up and down. This is not required when you simply add more channels, since they're just added to the sum for that cell. Since we're only sliding on one axis in your example (time), we only need Conv1D.

How to modify ssd mobilenet config to detect small objects using tensorflow object detection API?

I am trying to detect small objects from ipcam videostreams using ssd mobilenetv2. The model was trained on the high resolution images of these small objects where the objects are very close to the camera.Images were downloaded from internet.
I found that changing the anchorbox scales and modifying feature are the proposed solutions to overcome this.
Can anyone guide me how to do this?
mobilenet-ssd - is great for large objects, yet its performance for small objects is pretty poor.
It is always better to train with anchors tuned to the objects aspect ratios, and sizes you expect.
One more thing to take into account is that the first branch is the one which detects the smallest objects - the resolution of this branch is 1/16 of the input - you should consider adding another branch at the 1/8 feature map - which will help with small objects.
How to change anchors sizes and aspect ratios:
Let us take for example the pipeline.config file which is being used for the training configuration -
You will find there the following arguments:
90 anchor_generator {
91 ssd_anchor_generator {
92 num_layers: 6
93 min_scale: 0.20000000298
94 max_scale: 0.949999988079
95 aspect_ratios: 1.0
96 aspect_ratios: 2.0
97 aspect_ratios: 0.5
98 aspect_ratios: 3.0
99 aspect_ratios: 0.333299994469
100 }
101 }
num_layers - number of branches - starts from a branch from 1/16 of the input...
min_scale / max_scale - min_scale corresponds to the scale of the anchors in the first branch, max_scale corresponds to the scale of the last branch. While all the branches between gets scale from linear interpolation:
min_scale + (max_scale - min_scale)/(num_layers - 1) * (#branch) (same as defined in SSD: Single Shot MultiBox Detector -
aspect_ratios - list of aspect ratios define the anchors - this way you can decide what AR anchors to add, AR=1.0 means a square anchor, while 2.0 means that the anchor is landscape - while its width is x2 the height, 0.5 means portrait where the height is x2 the width...
the code can be find in the following path:
One more thing is that in mobilenet-v1-ssd - the first branch has only 3 anchors, i'm not sure how much mobilenet-v2-ssd has, but you may want to add more anchors. You will need to change it in the code (in
320 if layer == 0 and reduce_boxes_in_lowest_layer:
321 layer_box_specs = [(0.1, 1.0), (scale, 2.0), (scale, 0.5)]
as you seed it is hard coded to be three anchors...
How to start the branches earlier
This also would be needed to be changed inside the code. Each predefined model has its own model file - i.e. ssd_mobilenet_v2:
lines 111:117
feature_map_layout = {
'from_layer': ['layer_15/expansion_output', 'layer_19', '', '', '', ''
'layer_depth': [-1, -1, 512, 256, 256, 128][:self._num_layers],
'use_depthwise': self._use_depthwise,
'use_explicit_padding': self._use_explicit_padding,
You can choose what layers to start from by their name.
Now for my 2 cents, I didn't try mobilenet-v2-ssd, mainly used mobilenet-v1-ssd, but from my experience is is not a good model for small objects. I guess it can be optimized a little bit by editing the anchors, but not sure if it will be sufficient for your needs. for one stage ssd like network consider using ssd_mobilenet_v1_fpn_coco - it works on 640x640 input size, and its first branch is starts at 1/8 input size. (cons - bigger model, and higher inference time)

Dynamic Tensor Aligment/Cropping

I implemented Fully-Convolution Network at TensorFlow. It use encdoder-decoder structure.
When training, I use always same image size (224x224, using random crop) and everything works nicely.
In interference phase, I want to predict one image at a time, because I want to use full-image (not croped). For example, such image have size [406,256]. And here is problem.
In Encoder-Decoder architecture I add two tesors (z = x + y). When training, sizes of both tensor matches. When predicting my single image, sizes does not match (tensor sizes: [1,47,47,64] vs [1,46,46,64]). I think it is cause by some rounding done in Conv and Pool layer.
What should I change in my architecture to works for any image size I want? Should I change rounding parameters? Or add 'cropping' of tensor?
Link to implementation of architecture:
(the problem occur in line 166)
I found the solution for variable input size:)
What we really need was a 'Crop-layer', that crop one tensor to match other. I found really similar layer here:
I have just made it `crop_and_add' and it is working:
def crop_and_add(x1,x2):
x1_shape = tf.shape(x1)
x2_shape = tf.shape(x2)
# offsets for the top left corner of the crop
offsets = [0, (x1_shape[1] - x2_shape[1]) // 2, (x1_shape[2] - x2_shape[2]) // 2, 0]
size = [-1, x2_shape[1], x2_shape[2], -1]
x1_crop = tf.slice(x1, offsets, size)
return x1_crop + x2
All addition in model I replaced by above layer (so merging encoder and decoder data).
Also, the input to model need to be defined as:
image = tf.placeholder(tf.float32, shape=[1, None, None, 3], name="input_image")
So we know that we will pass single image and that image have 3 channels. but we do not know neither width nor height. And it works very nice! (40 FPS on K80 as AWS P2, size of image is 224x{}-shoter side of image have 224)
FYI, I was also trying to run ENET (2x faster than LinkNet), but in TensorFlow it is slower. I think it is because of PReLu (which is slow at TF). Also it does not support arbitraty size of image becauese of UnPool layer, which need to have predefined output size by list of integers (not placeholders). So LinkNet look better in case of Speed and Performacance in TF.

How can I achieve better than 80% on the test set

My goal is to detect digits from 0 to 9 on a random background. I wrote a dataset generator with the following features:
Grayscale data
Random digit rotation
Random digit blur
43 different fonts
Random noisy blurred background
Here are 1024 samples of my dataset:
1024 testset samples
I adapted the mnist expert model to train the dataset and get almost 100% on the train and validation set.
On the test set I get approximately 80% correct.
Here is a sample. The green digit is the digit predicted:
9 predicted as 5
It seems that my model has some troubles to distinguish between
1 and 7
8 and 3
9 and 6
5 and 9
I need to detect the digit on any background because the test images are not always binary images.
Now my questions:
For the testset generator:
How useful is applying digit rotation? When I rotate a 7 then I get a 1 for some fonts. When I rotate a 9 I get a 6 (rotation > 90°)
Is the convolution filter already treating image rotation?
Are 180'000 image samples enough to train the model?
For the model:
Should I increase the image size from 28x28 to 56x56 when I apply a blur filter onto the dataset?
What filter size should I use?
Do I have to increase the number of hidden layers?
Thanks a lot for any guide.
If you are stuck with the different image backgrounds, I suggest you try image filtering, which will turn your images into the same background for foreground, assuming your images have good qualities.
Try this (scikit-image library):
import numpy as np
from skimage import filters as flt
filtered_image = np.array(original_image > flt.threshold_li(original_image))
Then you can use the filtered images for both training and prediction.
I ended up extracting the dataset patches out of existing images instead of using a random background with random digits. This gives us less variance and a much better accuracy on the test set.
Here is a working but not so performant implementation which allows us to define shape and stride size:
def patchify(self, arr, shape, stride):
patches = []
arr_shape = arr.shape
(shape_h, shape_w) = shape
(stride_h, stride_w) = stride
num_patches = np.floor(np.array(arr_shape)/np.array(stride))
(num_patches_row, num_patches_col) = (int(num_patches[0]), int(num_patches[1]))
for row in range(num_patches_row):
row_from = row*stride_h
row_to = row_from+shape_h
for col in range(num_patches_col):
col_from = col * stride_w
col_to = col_from + shape_w
origin_information = (row_from,row_to, col_from,col_to)
roi = arr[row_from:row_to, col_from:col_to]
patches.append((roi, origin_information))
return patches
or we can also use scklearn where image is a numpy array
patches = image.extract_patches_2d(image, (patch_height, patch_width))

In Tensorflow, how to convert scores from neural net into discrete values as a part of learning process

Hello fellow tensorflowians!
I have a following schema:
I input some continous variables (actually, word embeddings I took from google word2vec), and I am trying to predict output that can be considered as continous as well as discrete (sorry, mathematicians! but it depends on one's training goal actually).
Output takes values from 0 to 1000 with interval of 0.25 (or a precision hyperparameter), so : 0, 0.25, 0.50, ..., 100.0 .
I know that it is not possible to include something like tf.to_int (I can omit fractions if it's necessary) or tf.round, because these are not differentiable, so we can't backpropagate. However, I feel that there is some solution that allows network to "know" that it is searching for rounded solution: some small fractions of integers like 0.25, 5.75, but I actually don't even know where to look. I looked up quantization, but that seems to be a bit of an overkill.
So my question is:
How to inform graph that we don't accept values below 0.0 ? Would doing abs on network output "logits" (regression predictions) be something worth considering? If no, can I modify the loss term to severely punish scores below 0 and using absolute error instead of squared error? I may be not aware of full consequences of doing that
I don't care whether prediction of 4.5 is 4.49999 or 4.4 because I round up predictions to nearest .25 to get accuracy, and that's my final model evaluation metric. If so, can I use?
precision = 0.01 # so that sqrt(precision) == 0.1
loss=tf.reduce_mean(tf.max(0, tf.square(tf.sub(logits, targets)) - precision ))