Tensorflow output labels is a value in the 2D gird or locating it in the grid - tensorflow

My final output should be a 2D grid that contains values for each grid point. Is there a way to implement in TensorFlow, where I can input a number of images and each image correspond to a specific point in a 2D grid? I want my model such that when I input a similar image it should result in detecting that specific grid in a 2D image. I mean that each image input image belongs to a specific area in the output image (which I divided into a grid for simplicity to make it a finite number of locations).

Related

Why isn't there a 3D array image in Vulkan?

In the Vulkan API it was seen as valuable to include a VK_IMAGE_VIEW_TYPE_CUBE_ARRAY, but not a 3D array:
typedef enum VkImageViewType {
VK_IMAGE_VIEW_TYPE_1D = 0,
VK_IMAGE_VIEW_TYPE_2D = 1,
VK_IMAGE_VIEW_TYPE_3D = 2,
VK_IMAGE_VIEW_TYPE_CUBE = 3,
VK_IMAGE_VIEW_TYPE_1D_ARRAY = 4,
VK_IMAGE_VIEW_TYPE_2D_ARRAY = 5,
VK_IMAGE_VIEW_TYPE_CUBE_ARRAY = 6,
} VkImageViewType;
Each 6 layers of view for a cube array is another cube. I'm actually struggling to think of a use case for a cube array, and I don't really think it would be useful for 3D array, but why does the cube get an array type and not the 3D image. How is this cube array even supposed to be used? Is there even a cube array sampler?
Cube maps, cube map arrays, and 2D array textures are, in terms of the bits and bytes of storage, ultimately the same thing. All of these views are created from the same kind of image. You have to specify if you need a layered 2D image to be usable as an array or a cubemap (or both), but conceptually, they're all just the same thing.
Each mipmap level consists of L images of a size WxH, where W and H shrink based on the original size and the current mipmap level. L is the number of layers specified at image creation time, and it does not change with the mipmap level. Put simply, there are a constant number of 2D images per mipmap level. Cubemaps and cubemap arrays require L to be either 6 or a multiple of 6 respectively, but it's still constant.
A 3D image is not that. Each mipmap level consists of a single image of size WxHxD, where W, H, and D shrink based on the original size and current mipmap level. Even if you think of a mipmap level of a 3D image as being D number of WxH images, the number D is not constant between mipmap levels.
These are not the same things.
To have a 3D array image, you would need to have each mipmap level contain L 3D images of size WxHxD, where L is the same for each mipmap level.
As for the utility of a cubemap array, it's the same utility you would get out of a 2D array compared to a single 2D image. You use array textures when you need to specify one of a number of images to sample. It's just that in one case, each image is a single 2D image, while in another case, each image is a cubemap.
For a more specific example, many advanced forms of shadow mapping require the use of multiple shadow maps, selected at runtime. That's a good use case for an array texture. You can apply these techniques to point lights through the use of cube maps, but now you need to have the individual images in the array be cube maps, not just single 2D images.

Get the location of object to crop by providing pixel label in tensorflow

I have a data-set of images(every image is in rgb format) and corresponding label image(which contains label of every pixel in the image).
I need to extract the objects(pixels) of a particular class from original images.
first i have to find location of object using label image(by providing label of given object)(it is doable by using explicit for loops but, i don't want to use explicit for loops)
Now my questions-
If there is any in-build function in tensorflow that gives me the location(Rectangles are fine) of given object(if i provide the labels of that object)?
After that i can use the tf.image.crop_and_resize to crop the image. but i am not able to find any function that will give me location of objects.

Use of base anchor size in Single Shot Multi-box detector

I was digging in the Tensorflow Object Detection API in order to check out the anchor box generations for SSD architecture. In this py file where the anchor boxes are generated on the fly, I am unable to understand the usage of base_anchor_size. In the corresponding paper, there is no mention of such thing. Two questions in short:
What is the use of base_anchor_size parameter? Is it important?
How does this parameter affect the training in the cases where the original input image is square in shape and the case when it isn't square?
In SSD architecture there are scales for anchors which are fixed ahead, e.g. linear values across the range 0.2-0.9. These values are relative to the image size. For example, given 320x320 image, then smallest anchor (with 1:1 ratio) will be 64x64, and largest anchor will be 288x288. However, if you wish to insert to your model a larger image, e.g. 640x640, but without changing the anchor sizes (for example since these are images of far objects, so there's no need for large objects; not leaving the anchor sizes untouched allows you not to fine-tune the model on the new resolution), then you can simply have a base_anchor_size=0.5, meaning the anchor scales would be 0.5*[0.2-0.9] relative to the input image size.
The default value for this parameter is [1.0, 1.0], meaning not having any affect.
The entries correspond to [height, width] relative to the maximal square you can fit in the image, meaning [min(image_height,image_width),min(image_height,image_width)]. So, if for example, your input image is VGA, i.e. 640x480, then the base_anchor_size is taken to be relative to [480,480].

How to refine the Graphcut cmex code based on a specific energy functions?

I download the following graph-cut code:
https://github.com/shaibagon/GCMex
I compiled the mex files, and ran it for pre-defined image in the code (which is rgb image)
I wanna optimize the image segmentation results,
I have probability map of the image, which its dimension is (width,height, 5). Five probability distribution over the image dimension are stacked together. each relates to one the classes.
My problem is which parts of code should according to the probability image.
I want to define Data and Smoothing terms based on my application.
My question is:
1) Has someone refined the code according to the defining different energy function (I wanna change Unary and pair-wise formulation).
2) I have a stack of 3D images. I wanna define 6-neighborhood system, 4 neighbors in current slice and the other two from two adjacent slices. In which function and part of code can I do the refinements?
Thanks

YOLO object detection model?

Currently, I am reading the Yolo9000 model "https://arxiv.org/pdf/1612.08242.pdf" and I am very confused about how the model can predict the bounding box for object detection, I did many examples with Tensorflow, and in most of them we give to the model "Images and Label of images".
My questions are:
1- How we can pass the bounding box instead of labels to the model?
2- How can the the model learn that many boxes belong to one images?
In YOLO, we divide the image into 7X7 grid. For each of the grid location, the network predicts three things -
Probability of an object being present in that grid
If an object lies in this grid, what would be the co-ordinates of the
bounding box?
If an object lies in this grid, which class does it
belong to?
If we apply regression for all the above variables for all 49 grid locations, we will be able tell which grid locations have objects(using first parameter). For the grid locations that have objects, we can tell the bounding box co-ordinates and correct class using the second and third parameters.
Once we have designed a network that can output all the information we need, prepare the training data in this format i.e. find these parameters for every 7X7 grid location in every image in your dataset. Next you simply train the deep neural network to regress for these parameters.
To pass bounding boxes of an image we need to create it first. You can create bounding boxes for any image using specific tools. Here, you have to create boundaries that bound an object within it and then label that bounding box/rectangle. You to do this for every object in the image you want your model to train/recognize.
There is one very useful project in this link, you should check that out if you need to understand about bounding boxes.
I have just started learning object detection with tensorflow. So as and when I get proper info on providing bounding boxes to the object detection model I'll also update that here. Also if you have solved this problem by now, you can also provide the details to help out others facing same kind of problems.
1- How we can pass the bounding box instead of labels to the model?
If we want to train a model that performs object detection (not object classification), we have to pass the truth labels as .xml files, for example. An xml file contains information about objects that exist in an image. Each information about object is composed of 5 values:
class name of this object, such as car or human...
xmin: x coordinate of the box's top left point
ymin: y coordinate of the box's top left point
xmax: x coordinate of the box's bottom right point
ymax: y coordinate of the box'x bottom right point
One bounding box within an image is specified as a set of 5 values like above. If there are 3 objects in an image, the xml file will contain 3 sets of this values.
2- How can the the model learn that many boxes belong to one images?
As you know, the output of YOLOv2 or YOLO9000 has shape (13, 13, D), where D depends on how many class of object you're going to detect. You can see that there are 13x13 = 169 cells (grid cells) and each cell as D values (depth).
Among 169 grid cells, there are some grid cells that are responsible to predict bounding boxes. If the center of a true bounding box falls on a grid cell, this grid cell is responsible to predict that bounding box, when it is given the same image.
I think there must be a function that reads the xml annotation files and determines which grid cells are responsible to detect bounding boxes.
To make the model learn the box positions and shapes not only the classes, we have to build an appropriate loss function. The loss function used in YOLOv puts cost also on the box shapes and positions. So the loss is calculated as the weighted sum of the following individual loss values:
Loss on the class name
Loss on the box position (x-y coordinates)
Loss on the box shape (box width and height)
SIDE NOTE:
Actually, one grid cell can detect up to B boxes, where B depends on
implementations of YOLOv2. I used darkflow to train YOLOv2 on my
custom training data, in which B was 5. So the model can detect 169*B
boxes in total, and loss is the sum of 169*B small losses.
D = B*(5+C), where C is the number of classes you want to detect.
Before passed to the model, the box shapes and positions are
converted into relative values to the image size.