Is it possible to get the results or coordinates of the mask detection or the bounding box surrounding the image? I am using Mask R-CNN from matterport and the visualization of the masks on the image works quite good, but I woukd like to save the coordinates.
I am not sure how you are using this model. But when you import their model and use the detect method (which is the straightforward way to use their model), then the coordinates are returned immediately.
See for an explanation of what is returned from model.detect this documentation.
In short, per image you get a dict and your coordinates will be the_dict["rois"][1].
Related
I am trying to use MATLAB's camera calibrator to calibrate an infrared camera. I was able to get the intrinsic matrix by just feeding around 100 images to the calibrator. But I'm struggling with how to get the extrinsic matrix [R|t].
Because the extrinsic matrix is used to map the world frame with the camera frame, so in theory, when the camera(object) is moving, there will be many extrinsic matrices.
In the picture below, if the intrinsic matrix is determined using 50 images, then there are 50 extrinsic matrices correspond to each image. Am I correct?
You are right. Usually, a by-product of an intrinsic calibration is the extrinsic matrix for each pattern observed; this is mostly used to draw the patterns with respect to the camera as in the picture you posted.
What you usually do afterwards is to define some external reference frame that makes sense for you application, also known as the 'world' reference frame, and compute the pose of the camera with respect to it. That's the extrinsic matrix you always hear about.
For this, you:
Define the reference frame and take some points with known 3D coordinates on it; this can be a grid drawn on the floor, for example.
Take a picture of the 3D points with the calibrated camera and get a list of the correspondent 2D (image) coordinates of the points.
Use a pose estimation function that takes: the camera intrinsic parameters, the 3D points and the correspondent 2D image points. I am more familiar with OpenCV, but the Matlab function that seems to do the job is: https://www.mathworks.com/help/vision/ref/estimateworldcamerapose.html
My problem is not exactly annotate data using polygon, circle or line, it's how to use these annotated data to gerenate a ".tfrecord" file and perform an object detection. The tutorials I saw use rectangle annotation, like these: taylor swift detection raccon detection
It would be a great one for me if the objects I want to detect (pipelines) were not too close.
Example of rectangle drawn in PASCAL VOC format:
<bndbox>
<xmin>82</xmin>
<xmax>172</xmax>
<ymin>108</ymin>
<ymax>146</ymax>
</bndbox>
Is there a way to add a "mask" to highlight some part of this bounding box?
If it's something unclear, please let me know.
You can go for instance segmentation instead of object detection if your objects are very close to each other, there you can use polygons to generate masks and bounding boxes to train the model.
Consider this well presented and easy to use repository for mask-rcnn(kind of instance segmentation)
https://github.com/matterport/Mask_RCNN
check this for lite weight mask-rcnn
I am using tensorflow and tflite to detect object. The model I use is mobilenet_ssd (version 2) from https://github.com/tensorflow/models/tree/master/research/object_detection
the input image size for detection is fixed 300*300, which is hard-coded in the model.
I want to input 1280*720 image for detection, how to do this? I do not have the traing image dataset of resolution 1280*720. I only have pascal and coco dataset.
How to modify the model to accept 1280*720 image(do not scale the image) for detection?
To change the input size of the image, you need to redesign the anchor box position. Because the anchors are fixed to the input image resolution. Once you change the anchor positions to 720P, then the mobilenet can accept 720p as input.
The common practice is scaling the input image before feeding the data into TensorFlow / TensorFlow Lite.
Note: The image in the training data set aren't 300*300 originally. The original resolution may be bigger and non-square, and it's downscaled to 300*300. It means it's totally fine to downscale 1280*720 image to 300*300 image and it should work fine.
Do you mind to try scaling and see if it works?
I'm trying to collect my own training data set for the image detection (Recognition, yet). Right now, I have 4 classes and 750 images for each. Each images are just regular images of the each classes; however, some of images are blur or contain outside objects such as, different background or other factors (but nothing distinguishable stuff). Using that training data set, image recognition is really bad.
My question is,
1. Does the training image set needs to contain the object in various background/setting/environment (I believe not...)?
2. Lets just say training worked fairly accurately and I want to know the location of the object on the image. I figure there is no way I can find the location just using the image recognition, so if I use the bounding box, how/where in the code can I see the location of the bounding box?
Thank you in advance!
It is difficult to know in advance what features your programm will learn for each class. But then again, if your unseen images will be in the same background, the background will play no role. I would suggest data augmentation in training; randomly color distortion, random flipping, random cropping.
You can't see in the code where the bounding box is. You have to label/annotate them yourself first in your collected data, using a tool as LabelMe for example. Then comes learning the object detector.
Tensorflow's Object Detection API has an option in the .config file to add an keep_aspect_ratio_resizer. If I resize my training data using this, will the corresponding bounding boxes be resized as well? If they don't match up then the network is seeing incorrect examples.
Yes, the boxes will be resized to be compatible with the images as well!