About MS COCO dataset Evaluation! Image Size! (question about COCO dataset.) - object-detection

I am a student currently studying object detection!
I have a question about COCO dataset.
I'm currently experimenting with COCO datasets, and there's APs APm APL in the performance evaluation metrics.
32X32 or less for APs, 32x32 to 96×96 for APm, 96×96 for APLs
It looks like this.
Is this standard for a specific image size?
Or does it mean the absolute pixel size?
If it means absolute pixel size, will the instances included in APs APm APL be different for each input size?
Thank you.

Related

Tensorflow object detection: why is the location in image affecting detection accuracy when using ssd mobilnet v1?

I'm training a model to detect meteors within a picture of the night sky and I have a fairly small dataset with about 85 images and each image is annotated with a bounding box. I'm using the transfer learning technique starting with the ssd_mobilenet_v1_coco_11_06_2017 checkpoint and Tensorflow 1.4. I'm resizing images to 600x600pixels during training. I'm using data augmentation in the pipeline configuration to randomly flip the images horizontally, vertically and rotate 90 deg. After 5000 steps, the model converges to a loss of about 0.3 and will detect meteors but it seems to matter where in the image the meteor is located. Do I have to train the model by giving examples of every possible location? I've attached a sample of a detection run where I tiled a meteor over the entire image and received various levels of detection (filtered to 50%). How can I improve this?detected meteors in image example
It could very well be your data and I think you are making a prudent move by improving the heterogeneity of your dataset, BUT it could also be your choice of model.
It is worth noting that ssd_mobilenet_v1_coco has the lowest COCO mAP relative to the other models in the TensorFlow Object Detection API model zoo. You aren't trying to detect a COCO object, but the mAP numbers are a reasonable aproximation for generic model accuracy.
At the highest possible level, the choice of model is largely a tradeoff between speed/accuracy. The model you chose, ssd_mobilenet_v1_coco, favors speed over accuracy. Consequently, I would reccomend you try one of the Faster RCNN models (e.g., faster_rcnn_inception_v2_coco) before you spend a signifigant amount of time preprocessing images.

How do different input image sizes/resolutions affect the output quality of semantic image segmentation networks?

While trying to perform image segmentation on images from one dataset (KITTI) with a deep learning network trained on another dataset (Cityscapes) I realized that there is a big difference in subjectively perceived quality of the output (and probably also when benchmarking the (m)IoU).
This raised my question, if and how size/resolution of an input image affects the output from a network for semantic image segmentation which has been trained on images with different size or resolution than the input image.
I attached two images and their corresponding output images from this network: https://github.com/hellochick/PSPNet-tensorflow (using provided weights).
The first image is from the CityScapes dataset (test set) with a width and height of (2048,1024). The network has been trained with training and validation images from this dataset.
CityScapes original image
CityScapes output image
The second image is from the KITTI dataset with a width and height of (1242,375):
KITTI original image
KITTI output image
As one can see, the shapes in the first segmented image are clearly defined while in the second one a detailed separation of objects is not possible.
Neural networks in general are fairly robust to variations in scale, but they certainly aren't perfect. Although I don't have references available off the top of my head there have been a number of papers that show that scale does indeed affect accuracy.
In fact training your network with a dataset with images at varying scales is almost certainly going to improve it.
Also, many of the image segmentation networks used today explicitly build constructs into the network to improve this at the level of the network architecture.
Since you probably don't know exactly how these networks were trained I would suggest that you resize your images to match the approximate shape that the network you are using was trained on. Resizing an image using normal image resize functions is quite a normal preprocessing step.
Since the images you are referencing there are large I'm also going to say that whatever data input pipeline you're feeding them through is already resizing the images on your behalf. Most neural networks of this type are trained on images of around 256x256. The input image is cropped and centered as necessary before training or prediction. Processing very large images like that is extremely compute-intensive and hasn't been found to improve the accuracy much.

SSD mobilenet trained model with custom data only recognize images in short distances

I've trained a model with a custom dataset (Garfield images) with Tensorflow Object Detection API (ssd_mobilenet_v1 model) and referring it in the android sample application available on Tensorflow repository. The application can only detected the images in distances less or equal 20cm approximately.
Do you have any clue about I can improve the model to perform recognitions in longer distances (about 30cm or more) ?
I don't know with this limitation is related with input size I'm using (tested with images with 300x300 and 68x68) or any custom data augmentation is needed to improve that.
SSD models are known to have worse performance on small objects. Have you tried using one of our FasterRCNN models to see if the result is acceptable?

small object detection with faster-RCNN in tensorflow-models

I'm attempting to train a faster-rccn model for small digit detection. I'm using the newly released tensorflow object detection API and so far have been fine tuning a pre-trained faster_rcnn_resnet101_coco from the zoo. All my training attempts have resulted in models with high precision but low recall. Out of the ~120 objects (digits) on each image only ~20 objects are ever detected, but when detected the classification is accurate. (Also, I am able to train a simple convnet from scratch on my cropped images with high accuracy so the problem is in the detection aspect of the model.) Each digit is on average 60x30 in the original images (and probably about half that size after the image is resized before being fed into the model.) Here is an example image with detected boxes of what I'm seeing:
What is odd to me is how it is able to correctly detect neighboring digits but completely miss the rest that are very similar in terms of pixel dimensions.
I have tried adjusting the hyperparameters around anchor box generation and first_stage_max_proposals but nothing has improved the results so far. Here is an example config file I have used. What other hyperparameters should I try adjusting? Any other suggestions on how to diagnose the problem? Should I be looking into other architectures or does my task look doable with faster-rccn and/or SSD?
In the end the immediate problem was that I was not using the visualizer correctly. By updating the parameters for visualize_boxes_and_labels_on_image_array as described by Johnathan in the comments I was able to see that that I am at least detecting more boxes than I had thought.
I check your config gile, you are decreasing the resolution of your image to 1024. The region of your digit will not contain a lot of pixel and you are loosing some information. What I suggest is to train the model with an another dataset (smaller images). You can for example crop the images in 4 four area.
If you have a good GPU increase the max dimension in the image_resizer, but I guess you will run out of memory

Retraining Inception and Downsampling

I have followed this Tensorflow tutorial on transfer learning with the Inception model using my own dataset of 640x360 images. My question comes in 2 parts
1) My data set conatains 640x360 images. Is the first operation that happens a downsampling to 299x299? I ask because I have a higher res version of the same dataset and I am wondering if training with the higher resolution images will result in different performance (hopefully better)
2) When running the network (using tf.sess.run()) is my input image down-sampled to 299x299?
Note: I have seen the 299x299 resolution stat listed many places online like this one and I am confused at exactly which images its referring to; the initial training dataset images (for Inception I think it was imagenet), the transfer learning dataset (my personal one), the input image when running the CNN, or a combination of the 3.
Thanks in advance :)
The inception model will resize your image to 299x299. This can be confirmed by visualizing the tensorflow graph. If you have enough samples to do the transfer learning, the accuracy will be good enough with resizing to 299x299. But if you really want to try out the training with actual resolution, the initial input layers of the graph size needs to be changed