Get the location of object to crop by providing pixel label in tensorflow - tensorflow

I have a data-set of images(every image is in rgb format) and corresponding label image(which contains label of every pixel in the image).
I need to extract the objects(pixels) of a particular class from original images.
first i have to find location of object using label image(by providing label of given object)(it is doable by using explicit for loops but, i don't want to use explicit for loops)
Now my questions-
If there is any in-build function in tensorflow that gives me the location(Rectangles are fine) of given object(if i provide the labels of that object)?
After that i can use the tf.image.crop_and_resize to crop the image. but i am not able to find any function that will give me location of objects.

Related

Tensorflow object detection API - Setting specific color to bounding boxes

I am trying to detect 3 different classes of objects in images using Tensorflow Object Detection. I would like to set the bounding box color for each class to a custom color of my choice in order to suit my application.
For example,
Class 1: Red
Class 2: Blue
Class 3: Green
Unfortunately, Tensorflow object detection automatically sets this colors and I do not know how to change them.
I would be very gratefull for any suggestions and help.
You can achieve this by passing a track_ids to function visualize_boxes_and_labels_on_image_array.
Notice that when detection is performed, this plot function is called to visualize the bounding boxes on images.
Here is how to get the variable track_ids. First you should look at the STANDARD_COLORS list and get the index of the color you want to draw boxes with. For example, 'Red' color's index is 98. Then you should loop through the variable output_dict['detection_classes'] (this variable is also passed to the plot function), and when encounter class 1, track_ids is appended with 98. By doing this you will create a list of indexes of colors as track_ids, then transform this to a numpy array and pass it together into the plot function, then you should have all classes plotted as the color you assigned.

Use of base anchor size in Single Shot Multi-box detector

I was digging in the Tensorflow Object Detection API in order to check out the anchor box generations for SSD architecture. In this py file where the anchor boxes are generated on the fly, I am unable to understand the usage of base_anchor_size. In the corresponding paper, there is no mention of such thing. Two questions in short:
What is the use of base_anchor_size parameter? Is it important?
How does this parameter affect the training in the cases where the original input image is square in shape and the case when it isn't square?
In SSD architecture there are scales for anchors which are fixed ahead, e.g. linear values across the range 0.2-0.9. These values are relative to the image size. For example, given 320x320 image, then smallest anchor (with 1:1 ratio) will be 64x64, and largest anchor will be 288x288. However, if you wish to insert to your model a larger image, e.g. 640x640, but without changing the anchor sizes (for example since these are images of far objects, so there's no need for large objects; not leaving the anchor sizes untouched allows you not to fine-tune the model on the new resolution), then you can simply have a base_anchor_size=0.5, meaning the anchor scales would be 0.5*[0.2-0.9] relative to the input image size.
The default value for this parameter is [1.0, 1.0], meaning not having any affect.
The entries correspond to [height, width] relative to the maximal square you can fit in the image, meaning [min(image_height,image_width),min(image_height,image_width)]. So, if for example, your input image is VGA, i.e. 640x480, then the base_anchor_size is taken to be relative to [480,480].

YOLO object detection model?

Currently, I am reading the Yolo9000 model "https://arxiv.org/pdf/1612.08242.pdf" and I am very confused about how the model can predict the bounding box for object detection, I did many examples with Tensorflow, and in most of them we give to the model "Images and Label of images".
My questions are:
1- How we can pass the bounding box instead of labels to the model?
2- How can the the model learn that many boxes belong to one images?
In YOLO, we divide the image into 7X7 grid. For each of the grid location, the network predicts three things -
Probability of an object being present in that grid
If an object lies in this grid, what would be the co-ordinates of the
bounding box?
If an object lies in this grid, which class does it
belong to?
If we apply regression for all the above variables for all 49 grid locations, we will be able tell which grid locations have objects(using first parameter). For the grid locations that have objects, we can tell the bounding box co-ordinates and correct class using the second and third parameters.
Once we have designed a network that can output all the information we need, prepare the training data in this format i.e. find these parameters for every 7X7 grid location in every image in your dataset. Next you simply train the deep neural network to regress for these parameters.
To pass bounding boxes of an image we need to create it first. You can create bounding boxes for any image using specific tools. Here, you have to create boundaries that bound an object within it and then label that bounding box/rectangle. You to do this for every object in the image you want your model to train/recognize.
There is one very useful project in this link, you should check that out if you need to understand about bounding boxes.
I have just started learning object detection with tensorflow. So as and when I get proper info on providing bounding boxes to the object detection model I'll also update that here. Also if you have solved this problem by now, you can also provide the details to help out others facing same kind of problems.
1- How we can pass the bounding box instead of labels to the model?
If we want to train a model that performs object detection (not object classification), we have to pass the truth labels as .xml files, for example. An xml file contains information about objects that exist in an image. Each information about object is composed of 5 values:
class name of this object, such as car or human...
xmin: x coordinate of the box's top left point
ymin: y coordinate of the box's top left point
xmax: x coordinate of the box's bottom right point
ymax: y coordinate of the box'x bottom right point
One bounding box within an image is specified as a set of 5 values like above. If there are 3 objects in an image, the xml file will contain 3 sets of this values.
2- How can the the model learn that many boxes belong to one images?
As you know, the output of YOLOv2 or YOLO9000 has shape (13, 13, D), where D depends on how many class of object you're going to detect. You can see that there are 13x13 = 169 cells (grid cells) and each cell as D values (depth).
Among 169 grid cells, there are some grid cells that are responsible to predict bounding boxes. If the center of a true bounding box falls on a grid cell, this grid cell is responsible to predict that bounding box, when it is given the same image.
I think there must be a function that reads the xml annotation files and determines which grid cells are responsible to detect bounding boxes.
To make the model learn the box positions and shapes not only the classes, we have to build an appropriate loss function. The loss function used in YOLOv puts cost also on the box shapes and positions. So the loss is calculated as the weighted sum of the following individual loss values:
Loss on the class name
Loss on the box position (x-y coordinates)
Loss on the box shape (box width and height)
SIDE NOTE:
Actually, one grid cell can detect up to B boxes, where B depends on
implementations of YOLOv2. I used darkflow to train YOLOv2 on my
custom training data, in which B was 5. So the model can detect 169*B
boxes in total, and loss is the sum of 169*B small losses.
D = B*(5+C), where C is the number of classes you want to detect.
Before passed to the model, the box shapes and positions are
converted into relative values to the image size.

Input parameters of CITemperatureAndTint (CIFilter)

I fail to understand the input parameters of the CIFilter named CITemperatureAndTint. The documentation says it has two input parameters which are both a 2D CIVector.
I played with this filter a lot - via actual code, via Core Image Fun House (example project from Apple - "FunHouse") and via iPhoto.
My intuition says that this filter should have two scalar input parameters: One for the temperature and one for the tint. If you look at the UI of iPhoto you see this:
Screenshot of iPhotos Temperature and Tint UI:
As expected: One slider for the temperature and one for the hue. How did apple "bind" the value of each slider to a 2D-Vector? akaru asked this question already but got no answer: What's up with CITemperatureAndTint having vector inputs?
I have opened a technical support incident at Apple and asked them the same question. Here is the answer from the Apple engineer:
CITemperatureAndTint has three input parameters: Image, Neutral and
TargetNeutral. Neutral and TargetNeutral are of 2D CIVector type, and
in both of them, note that the first dimension refers to Temperature
and the second dimension refers to Tint. What the CITemperatureAndTint
filter basically does is computing a matrix that adapts RGB values
from the source white point defined by Neutral (srcTemperature,
srcTint) to the target white point defined by TargetNeutral
(dstTemperature, dstTint), and then applying this matrix on the input
image (using the CIColorMatrix filter). If Neutral and TargetNeutral
are of the same values, then the image will not change after applying
this filter. I don't know the implementation details about iPhoto, but
I think the two slide bars give the Temperature and Tint changes (i.e.
differences between source and target Temperature and Tint values
already) that you want to add to the source image.
Now I have to get my head around this answer but it seems to be a very good response from Apple.
They should be 2D vectors containing the color temperature. The default of (6500, 0) will leave the color unchanged, as described here. You can see what values for color temperature give you which colors in this wikipedia link. I'm not sure what the 2nd element of the vector is for.

How to count objects detected in an image using Tensorflow?

I have trained my custom object detector using faster_rcnn_inception_v2, tested it using object_detection_tutorial.ipynb and it works perfect, I can find bounding boxes for the objects inside the test image, my problem is how can I actually count the number of those bounding boxes or simply I want to count the number of objects detected for each class.
Because of low reputations I can not comment.
As far as I know the object detection API unfortunately has no built-in function for this.
You have to write this function by yourself. I assume you run the eval.py for evaluation!? To access the individual detected objects for each image you have to follow the following chain of scripts:
eval.py -> evaluator.py ->object_detection_evaluation.py -> per_image_evaluation.py
In the last script you can count the detected objects and bounding boxes per image. You just have to save the numbers and sum them up over your entire dataset.
Does this already help you?
I solved this using Tensorflow Object Counting API. We have an example of counting objects in an image using single_image_object_counting.py. I just replaced ssd_mobilenet_v1_coco_2017_11_17 with my own model containing inference graph
input_video = "image.jpg"
detection_graph, category_index = backbone.set_model(MODEL_DIR)