Region of interest or mask as input to CNN, for faster feedforward computation - tensorflow

My input image is huge and only a small irregularly shaped region shall be used as input. I cannot crop it because the shape of the ROI causes the cropped region to be almost the same dimension as the input. I understand the receptive field and background context stuff, which impacts accuracy. Still, i am curious if any framework allows me to specify a binary mask as ROI?

Related

What kind of Neural network to use for classifing a dotted image to a number depending on the number of and size of dots?

I am currently trying to train a CNN to classify images that consists of dots to a class where the class is a value depending on the number and size of the dots. More dots should be in a high number class and less dots a low number class.
I wonder if there is an alternative to a CNN for this task. I just started designing a CNN since it was an image problem but then realized that the difference to other object classification problems in images is that these images don't really have the same properties, like edges for example, that object images have.
The main goal is to get a number out of the network when the input is an image of this kind and I don't have a preference of how to do it except that it must be a Machine learning solution.
This is how the images look. I have the possibility to use two different kinds where one is the original and the other is binary grayscale black and white.
Binary black and white image
Original image
you can convert image to binary, where pixel will have 0 and 1, assume 0 is background and 1 is the dots. you can sum all the 1`s in that image to get class of yours, to normalize output you can divide it but some number.
if you want machine learning solution then just feed that binary image to a single Dense layer, and it will be a regression problem not classification.
your output activation function should be RELU, and loss function MSE.

Tensorflow Object Detection: Resizing Images and Padding

I am trying to create a tensorflow object detection with Single Shot Multibox Detector (SSD) with MobileNet. My dataset consists of images larger than 300x300 pixels (e.g. 1280x1080). I know that tensorflow object detection reduces the images to 300x300 in the training process, what I am interested in is the following:
Does it have a positive or negative influence on the later object detection if I reduce the pictures to 300x300 pixels before the training with padding? Without padding I don't think it has any negative effects, but with padding I'm not sure if it has any effects that I'm overlooking.
Thanks a lot in advance!
I dont know SSD, but CNNs generally use convolutional layers as feature extractors, stacked upon another with different kernel sizes representing different feature sizes, i.e. using spatial correlation to their advantage. If you use padding, the padding will thus be incorporated into the extracted features, possible corrupting your results.

Is it reasonable to change the input shape for a trained convolutional neural network

I've seen a number of super-resolution networks that seem to imply that it's fine to train a network on inputs of (x,y,d) but then pass in images of arbitrary sizes into a model for prediction, that in Keras for example is specified with the placeholder values (None,None,3) and will accept any size.
for example https://github.com/krasserm/super-resolution is trained on inputs of 24x24x3 but accepts arbitrary sized images for resize, the demo code using 124x118x3.
Is this a sane practice? Does the network when given a larger input simply slide a window over it applying the same weights as it learnt on the smaller size image?
Your guess is correct. Convolutional layers learn to distinguish features at the scale of their kernel, not at the scale of the image as a whole. A layer with a 3x3 kernel will learn to identify a feature up to 3x3 pixels large and will be able to identify that feature in an image whether the image is itself 3x3, 100x100, or 1080x1920.
There will be absolutely no problem with the convolutions, they will work exactly as they are expected to, with the same weights, same kernel size, etc. etc.
The only possible problem is: the model may not have learned the new scale of your images (because it has never seen this scale before) and may give you poor results.
On the other hand, the model could be trained with many sizes/scales, becoming more robust to variation.
There will be a problem with Flatten, Reshape, etc.
Only GlobalMaxPooling2D and GlobalAveragePooling2D will support different sizes.

MobileNet-SSD input resolution

I have a working object detection model (fined-tuned MobileNet SSD) that detects my custom small robot. I'll feed it some webcam footage (which will be tied to a drone) and use the real-time bounding box information.
So, I am about to purchase the camera.
My questions: since SSD resizes the input images into 300x300, is the camera resolution very important? Does higher resolution mean better accuracy (even when it gets resized to 300x300 anyway)? Should I crop the camera footage into 1:1 aspect ratio at every frame before running my object detection model on it? Should I divide the image into MxN cropped segments and run inference one by one?
Because my robot is very small and the drone will be at a 4 meter altitude, so I'll effectively be trying to detect a very tiny spot on my input image.
Any sort of wisdom is greatly appreciated, thank you.
These are quite a few questions, I'll try to answer all of them. The detection model resizes the input images before feeding it to the network by some resizing method, e.g. bilinear. It would be better of course if the input image would be equal or larger than the input size to the network rather than smaller. A rule of thumb is that indeed higher resolution means better accuracy, but it highly depends on the setup and the task. If you're trying to detect a small object, and let's say for example that the original resolution is 1920x1080. Then after resizing the image, the small object would be even smaller (pixels-wise), and might be too small to detect. Therefore, indeed, it would be better to either split the image to smaller images (possibly with some overlap to avoid misdetection due to object splitting) and applying detection on each, or using a model with higher input resolution. Be aware that while the first is possible with your current model, you'll need to train a new model possibly with some architectural changes (e.g. adding SSD layers and modifying anchors, depends on the scales you want to detect) for the latter. Regarding the aspect ratio matter, you mostly need to stay consistent. It doesn't matter if you don't keep the original aspect ratio, but if you don't - do it both in training and evaluation/test/deployment.

How do different input image sizes/resolutions affect the output quality of semantic image segmentation networks?

While trying to perform image segmentation on images from one dataset (KITTI) with a deep learning network trained on another dataset (Cityscapes) I realized that there is a big difference in subjectively perceived quality of the output (and probably also when benchmarking the (m)IoU).
This raised my question, if and how size/resolution of an input image affects the output from a network for semantic image segmentation which has been trained on images with different size or resolution than the input image.
I attached two images and their corresponding output images from this network: https://github.com/hellochick/PSPNet-tensorflow (using provided weights).
The first image is from the CityScapes dataset (test set) with a width and height of (2048,1024). The network has been trained with training and validation images from this dataset.
CityScapes original image
CityScapes output image
The second image is from the KITTI dataset with a width and height of (1242,375):
KITTI original image
KITTI output image
As one can see, the shapes in the first segmented image are clearly defined while in the second one a detailed separation of objects is not possible.
Neural networks in general are fairly robust to variations in scale, but they certainly aren't perfect. Although I don't have references available off the top of my head there have been a number of papers that show that scale does indeed affect accuracy.
In fact training your network with a dataset with images at varying scales is almost certainly going to improve it.
Also, many of the image segmentation networks used today explicitly build constructs into the network to improve this at the level of the network architecture.
Since you probably don't know exactly how these networks were trained I would suggest that you resize your images to match the approximate shape that the network you are using was trained on. Resizing an image using normal image resize functions is quite a normal preprocessing step.
Since the images you are referencing there are large I'm also going to say that whatever data input pipeline you're feeding them through is already resizing the images on your behalf. Most neural networks of this type are trained on images of around 256x256. The input image is cropped and centered as necessary before training or prediction. Processing very large images like that is extremely compute-intensive and hasn't been found to improve the accuracy much.