Fully convolutional neural network for semantic segmentation - tensorflow

I have perhaps a naive question and sorry if this is not the appropriate channel to ask about these kind of questions. I have successfully implemented a FCNN for semantic segmentation, but I don't involve deconvolution or unpooling layers.
What I simply do, is to resize the ground truth image to the size of my final FCNN layer and then I compute my loss. In this way, I obtain a smaller image as output, but correctly segmented.
Is the process of deconvolution or unpooling needed at all?
I mean, resizing images in python is quite easy, so why one should involve complicated techniques as deconv or unpooling to do the same? Surely I miss something.
What's the advantage in enlarging images using unpooling and performing deconv?

The output of your network after the convolution steps is smaller than your original image: you probably don't want that, you want to have semantic segmentation for the image you give it as input.
If you simply resize it to its original size, new pixels will be interpolated and therefore lack precision. Deconvolution layers allow to learn this resize (as they're learned during training, through backpropagation), and therefore to increase your segmentation precision.

Related

Tensorflow Object Detection: Resizing Images and Padding

I am trying to create a tensorflow object detection with Single Shot Multibox Detector (SSD) with MobileNet. My dataset consists of images larger than 300x300 pixels (e.g. 1280x1080). I know that tensorflow object detection reduces the images to 300x300 in the training process, what I am interested in is the following:
Does it have a positive or negative influence on the later object detection if I reduce the pictures to 300x300 pixels before the training with padding? Without padding I don't think it has any negative effects, but with padding I'm not sure if it has any effects that I'm overlooking.
Thanks a lot in advance!
I dont know SSD, but CNNs generally use convolutional layers as feature extractors, stacked upon another with different kernel sizes representing different feature sizes, i.e. using spatial correlation to their advantage. If you use padding, the padding will thus be incorporated into the extracted features, possible corrupting your results.

Is it reasonable to change the input shape for a trained convolutional neural network

I've seen a number of super-resolution networks that seem to imply that it's fine to train a network on inputs of (x,y,d) but then pass in images of arbitrary sizes into a model for prediction, that in Keras for example is specified with the placeholder values (None,None,3) and will accept any size.
for example https://github.com/krasserm/super-resolution is trained on inputs of 24x24x3 but accepts arbitrary sized images for resize, the demo code using 124x118x3.
Is this a sane practice? Does the network when given a larger input simply slide a window over it applying the same weights as it learnt on the smaller size image?
Your guess is correct. Convolutional layers learn to distinguish features at the scale of their kernel, not at the scale of the image as a whole. A layer with a 3x3 kernel will learn to identify a feature up to 3x3 pixels large and will be able to identify that feature in an image whether the image is itself 3x3, 100x100, or 1080x1920.
There will be absolutely no problem with the convolutions, they will work exactly as they are expected to, with the same weights, same kernel size, etc. etc.
The only possible problem is: the model may not have learned the new scale of your images (because it has never seen this scale before) and may give you poor results.
On the other hand, the model could be trained with many sizes/scales, becoming more robust to variation.
There will be a problem with Flatten, Reshape, etc.
Only GlobalMaxPooling2D and GlobalAveragePooling2D will support different sizes.

Deep learning training with nonidentical images?

[![enter image description here][1]][1]I am actually reconstructing some images using dual photography. Next, I want to train a network to reconstruct clear images by removing noise (Denoising autoencoder).
The input for training the network is reconstructed images, whereas, the output is ground truth or computer based standard test images. Now the input e.g., Lena is some how not exact version of Lena with image shifted in positions and some artifacts.
If I keep input as my reconstructed image and training output as Lena test image (computer standard test image) , will it work?
I only want to know if input/output shifted or some details missing in one of them (due to some cropping) would work.
It depends on many factors like your images for training and the architecture of the network.
However, what you want to do is to make a network that learns the noise or low level information and for this purpose Generative Adversarial Networks (GAN) are very popular. You can read about them here. Maybe, after you have tried your approach and if the results are not satisfactory then try using GANs, like, DCGAN (Deep Convolution GAN).
Also, share your outcomes with the community if you would like.
Denoising Autoencoders! Love it!
There is no reason for not training your model with those images. The autoencoder, if well trained, will eventually learn the transformation if there is enough data.
However, if you have the 'positive' images, I strongly recommend you to create your own noisy images and then train in that controlled working area. You will simplify your problem and it will be easier to solve.
What is stopping you from doing just that?

How do different input image sizes/resolutions affect the output quality of semantic image segmentation networks?

While trying to perform image segmentation on images from one dataset (KITTI) with a deep learning network trained on another dataset (Cityscapes) I realized that there is a big difference in subjectively perceived quality of the output (and probably also when benchmarking the (m)IoU).
This raised my question, if and how size/resolution of an input image affects the output from a network for semantic image segmentation which has been trained on images with different size or resolution than the input image.
I attached two images and their corresponding output images from this network: https://github.com/hellochick/PSPNet-tensorflow (using provided weights).
The first image is from the CityScapes dataset (test set) with a width and height of (2048,1024). The network has been trained with training and validation images from this dataset.
CityScapes original image
CityScapes output image
The second image is from the KITTI dataset with a width and height of (1242,375):
KITTI original image
KITTI output image
As one can see, the shapes in the first segmented image are clearly defined while in the second one a detailed separation of objects is not possible.
Neural networks in general are fairly robust to variations in scale, but they certainly aren't perfect. Although I don't have references available off the top of my head there have been a number of papers that show that scale does indeed affect accuracy.
In fact training your network with a dataset with images at varying scales is almost certainly going to improve it.
Also, many of the image segmentation networks used today explicitly build constructs into the network to improve this at the level of the network architecture.
Since you probably don't know exactly how these networks were trained I would suggest that you resize your images to match the approximate shape that the network you are using was trained on. Resizing an image using normal image resize functions is quite a normal preprocessing step.
Since the images you are referencing there are large I'm also going to say that whatever data input pipeline you're feeding them through is already resizing the images on your behalf. Most neural networks of this type are trained on images of around 256x256. The input image is cropped and centered as necessary before training or prediction. Processing very large images like that is extremely compute-intensive and hasn't been found to improve the accuracy much.

What to expect from deep learning object detection on black and white pictures?

With TensorFlow, I want to train an object detection model with my own images based on ssd_inception_v2_coco model. The problem I have is that all my pictures are black and white. What performance can I expect? Should I try to colorize my B&W pictures first? Or at the opposite, should I try to retrain base network with images "uncolorized"? Are there general guidelines for B&W processing of images for deep learning object detection?
I wouldn't go through the trouble of colorizing if you are planning on using a pretrained model. I would expect that explicitly colorizing your images as a pre-processing step would help very little (if at all) since in theory the features that a colorizing network learns can also be learned by the detection network.
If you are planning on pretraining your detection network that was trained on an RGB dataset, make sure you either (i) replace the first convolution in the network with a convolutional layer that expects a single-channel input, or (ii) pad your image with two all-zero channels.
You may get slightly worse detection performance simply because you lose two thirds of the image's pixel information when using BW instead of RGB.