Is there an optimal size on which to run the object detection networks available in the object detection API? The API seems to accept images of all sizes, but it is unclear to me what type and how the image is being rescaled before feeding to the network.
Could you please clarify?
Thanks!
There is a script called preprocessor_builder which is responsible for that. So whenever you feed an image to the network it has to go through this preprocessing and makes sure that the image is resized properly to match the network depending on your network configuration file.
And actual resizing is happening here.
The answer is dependent on which model you're running. For our SSD models, we will reshape the image to 300x300 pixels. For FasterRCNN or RFCN, we'll reshape between 600-1024 pixels.
The images the user should add into the TFRecord can be any size, but we recommend users keep sizes as small as possible (ie. ~400-600px max per dimension for SSD, or ~1500px max per dimension for FasterRCNN or R-FCN) for memory reasons.
Related
I am using the tensorflow object detection API for the object detection task. However, I have objects that are captured from a high angle (camera at 10 m) and in a very small size where the size of images is 1920 x 1080.
Questions:
1) What is the best way to detect small objects under this condition?
2) What are the features of suitable dataset? Images from the same views (maybe!)?
I appreciate all of your answers, Thanks :)
You have to consider object detector's input size, even if you use high resolution image such as 1920x1080.
Because object detector resize input image to their architecture size(ex. general YOLO use 410x410 input in their architecture)
On the other hand, if you use 1920x1080 image as it is, your API will resize it to small resolution like 410x410.
It means your small objects in images will be disappeared during passing through convolution filter.
In my opinion,
1) If you know where small objects is located in whole image, CROP&SEPARATE image and USE as an input image.
Even though you do not know where small objects is, you can make several candidate list that is separated by some method.
2) I don't understand what you want to know, please let me know more specific.
I think you should try "faster_rcnn_resnet101" model with kitti dataset, this has the max image size of 1987. but this model is very slow compared to any other SSD models. The configuration link is below -
https://github.com/tensorflow/models/blob/001a2a61285e378fef5f45386f638cb5e9f153c7/research/object_detection/samples/configs/faster_rcnn_resnet101_kitti.config
Also the Faster rcnn models do better job compared to yolo in small object detection, not sure of performance with ssd model.
I am trying to build CNN model using TensorFlow at my own data set. But i faced with problem that is i have many pictures with different sizes. There are one kind of object in my pictures. If i make all pictures with same size, objects at pictures are not same size. In order to run CNN model with TensorFlow how to fix this problem? I heard one thing from others that is no matter having different size of input data, using tf.reduce_max, tf.reduce_mean is the best solution. if it is true that best solution to fix my problem, how to use this in my CNN model?
If i make all pictures with same size, objects at pictures are not same size.
If you know already how to make your input images to have the same size, you are ready for your task to train your CNN model. Unless you have a strict need to make the object for the picture to have the same size, it does not matter to the network.
Usual approach is to resize the images to a fixed size that is accepted by the network as input. This means distorting the aspect ratio of objects.
If that bothers you, you could try padding the images to a square (supposing the network input is a square) and then resize. This will keep the aspect ratio, but add some extra-information (the padding).
Another option is to crop the image to a square, if you are confident you are not losing important information and your task allows it.
The Tensorflow Object Detection API offers a variety of models. These are trained at 600x600 image size. Suppose I have a 6000x4000 satellite image, and I want to detect objects continuously throughout the image. What is the best practice for adapting a TFODI model to this image size? I don't care about the running time per image for object detection. I have a GPU with 9GB of RAM.
I know I can fit a single 6000x4000 image onto this GPU. I'm not sure if I can fit an image processing neural net for that size onto the GPU. I can think of a few alternatives:
Chip the image into 600x600 blocks, which risks losing features that cross the blocks, but then everything should work out of the box.
Change the image dimensions in the model definition from 600x600 to 6000x4000. Can I retrain from the Model Zoo checkpoint, or do I have to start from scratch if I do this?
Compress the image to smaller size. This distorts the image dimensions and also loses feature detail. For say a picture of a city, the resulting detail would not be adequate to pick out cars and small houses.
You need to try with different sizes and see using what size during training you don't run out of memory. The memory consumption also depends how many images you have that you are training on. From what you described you will end up using a intermediate size of the image
While trying to perform image segmentation on images from one dataset (KITTI) with a deep learning network trained on another dataset (Cityscapes) I realized that there is a big difference in subjectively perceived quality of the output (and probably also when benchmarking the (m)IoU).
This raised my question, if and how size/resolution of an input image affects the output from a network for semantic image segmentation which has been trained on images with different size or resolution than the input image.
I attached two images and their corresponding output images from this network: https://github.com/hellochick/PSPNet-tensorflow (using provided weights).
The first image is from the CityScapes dataset (test set) with a width and height of (2048,1024). The network has been trained with training and validation images from this dataset.
CityScapes original image
CityScapes output image
The second image is from the KITTI dataset with a width and height of (1242,375):
KITTI original image
KITTI output image
As one can see, the shapes in the first segmented image are clearly defined while in the second one a detailed separation of objects is not possible.
Neural networks in general are fairly robust to variations in scale, but they certainly aren't perfect. Although I don't have references available off the top of my head there have been a number of papers that show that scale does indeed affect accuracy.
In fact training your network with a dataset with images at varying scales is almost certainly going to improve it.
Also, many of the image segmentation networks used today explicitly build constructs into the network to improve this at the level of the network architecture.
Since you probably don't know exactly how these networks were trained I would suggest that you resize your images to match the approximate shape that the network you are using was trained on. Resizing an image using normal image resize functions is quite a normal preprocessing step.
Since the images you are referencing there are large I'm also going to say that whatever data input pipeline you're feeding them through is already resizing the images on your behalf. Most neural networks of this type are trained on images of around 256x256. The input image is cropped and centered as necessary before training or prediction. Processing very large images like that is extremely compute-intensive and hasn't been found to improve the accuracy much.
I am training my own image set using Tensorflow for Poets as an example,
https://codelabs.developers.google.com/codelabs/tensorflow-for-poets/
What size do the images need to be. I have read that the script automatically resizes the image for you, but what size does it resize them to. Can you preresize your images to this to save on your disk space (10,000 1mb images).
How does it crop the images, does it chop off part of your image, or add white/black bars, or change the aspect ratio?
Also, I think Inception v3 uses 299x299 images, what if your image recogition requires more detailed accuracy, is it possible to increase the networks image size, like to 598x598?
I don't know what re-sizing option this implementation uses; if you haven't found that in the documentation, then I expect that we'd need to read the code.
The images can be of any size. Yes, you can shrink your images to save disk space. However, note that you lose image detail; there won't be a way to recover the lost information.
The good news is that you shouldn't need it; CNN models are built for an image size that contains enough detail to handle the problem at hand. Greater image detail generally does not translate to greater accuracy in classification. Doubling the image resolution is usually a waste of storage.
To do that, you'd have to edit the code to accept the larger "native" image size. Then you'd have to alter the model topology to account for the greater input size: either a larger step-down factor somewhere (which could defeat the greater resolution), or another layer on the model to capture the larger size.
To get a more accurate model, you generally need a stronger network topology. 2x resolution does not give us much more information to differentiate a horse from a school bus.