When rendering to multiple layers is the correct image view 2D or 2D_ARRAY? - vulkan

When rendering to multiple layers using multiview, the layers field of the framebuffer creation struct must be set to 1 (maybe counterintuitive), and each image rendered to must have the multiple layers for the layer count you're rendering.
So if I'm rendering for example to 4 layers, I get an image view for the particular image with baseArrayLayer = 0, and array layer count of 4. My question is whether the image_views that I use to create the framebuffer VkImageViewType needs to be VK_IMAGE_VIEW_TYPE_2D or VK_IMAGE_VIEW_TYPE_2D_ARRAY.

Related

Why isn't there a 3D array image in Vulkan?

In the Vulkan API it was seen as valuable to include a VK_IMAGE_VIEW_TYPE_CUBE_ARRAY, but not a 3D array:
typedef enum VkImageViewType {
VK_IMAGE_VIEW_TYPE_1D = 0,
VK_IMAGE_VIEW_TYPE_2D = 1,
VK_IMAGE_VIEW_TYPE_3D = 2,
VK_IMAGE_VIEW_TYPE_CUBE = 3,
VK_IMAGE_VIEW_TYPE_1D_ARRAY = 4,
VK_IMAGE_VIEW_TYPE_2D_ARRAY = 5,
VK_IMAGE_VIEW_TYPE_CUBE_ARRAY = 6,
} VkImageViewType;
Each 6 layers of view for a cube array is another cube. I'm actually struggling to think of a use case for a cube array, and I don't really think it would be useful for 3D array, but why does the cube get an array type and not the 3D image. How is this cube array even supposed to be used? Is there even a cube array sampler?
Cube maps, cube map arrays, and 2D array textures are, in terms of the bits and bytes of storage, ultimately the same thing. All of these views are created from the same kind of image. You have to specify if you need a layered 2D image to be usable as an array or a cubemap (or both), but conceptually, they're all just the same thing.
Each mipmap level consists of L images of a size WxH, where W and H shrink based on the original size and the current mipmap level. L is the number of layers specified at image creation time, and it does not change with the mipmap level. Put simply, there are a constant number of 2D images per mipmap level. Cubemaps and cubemap arrays require L to be either 6 or a multiple of 6 respectively, but it's still constant.
A 3D image is not that. Each mipmap level consists of a single image of size WxHxD, where W, H, and D shrink based on the original size and current mipmap level. Even if you think of a mipmap level of a 3D image as being D number of WxH images, the number D is not constant between mipmap levels.
These are not the same things.
To have a 3D array image, you would need to have each mipmap level contain L 3D images of size WxHxD, where L is the same for each mipmap level.
As for the utility of a cubemap array, it's the same utility you would get out of a 2D array compared to a single 2D image. You use array textures when you need to specify one of a number of images to sample. It's just that in one case, each image is a single 2D image, while in another case, each image is a cubemap.
For a more specific example, many advanced forms of shadow mapping require the use of multiple shadow maps, selected at runtime. That's a good use case for an array texture. You can apply these techniques to point lights through the use of cube maps, but now you need to have the individual images in the array be cube maps, not just single 2D images.

Tensorflow output labels is a value in the 2D gird or locating it in the grid

My final output should be a 2D grid that contains values for each grid point. Is there a way to implement in TensorFlow, where I can input a number of images and each image correspond to a specific point in a 2D grid? I want my model such that when I input a similar image it should result in detecting that specific grid in a 2D image. I mean that each image input image belongs to a specific area in the output image (which I divided into a grid for simplicity to make it a finite number of locations).

Yolo Training: multiple objects in one image

I have a set of training images that contain many small objects (10-20). The image resolution is high (9000x6000).
Is it better to split the image into the specific objects before running yolo training? Or just leave it as is.
Does yolo resize an entire image, or does it ‘extract’ the annotated object first before resizing?
If it is the former, I am concerned that the resolution will be bad. Imagine 20 objects in a 416x416 image.
Does yolo resize an entire image, or does it ‘extract’ the annotated
object first before resizing?
Yes, an entire image will be resized in case of Yolo and it does not extract annotated object before resizing.
Since your input images have very high resolution, what you can do is:
Yolo can handle object sizes of 25 x 25 effectively with network input layer size 608 x 608. So if your object sizes in original input image are greater than 250 x 250 you can train the images as they are (with 608 x 608 network size). In that case even when images are resized to network size, objects will be of size greater than 25x25. This should give you good accuracy.
(6000/600) * 25 = 250
If object sizes in original images are smaller than 200 x 200, split your input image into 8 smaller units/blocks, say blocks/tiles of 2250 x 1500. Train these blocks as individual images. Each bigger image (9000 x 6000) corresponds to 8 training images. Each image might contain zero to many objects. You can operate in sliding window method.
The method you choose for training should be used for inference as well.
For training on objects of all sizes use following models: [Use this if you use original image as it is used for training]
Yolov4-custom
Yolov3-SPP
Yolov3_5l
If all of the objects that you want to detect are of smaller size, then for effective detection use Yolov4 with following changes: [Use this if you split original image into 8 blocks]
Set layers = 23 instead of layers = 54
Set stride=4 instead of stride=2
Set stride=4 instead of stride=2
References:
Refer this relevant GitHub thread
darknet documentation

Use of base anchor size in Single Shot Multi-box detector

I was digging in the Tensorflow Object Detection API in order to check out the anchor box generations for SSD architecture. In this py file where the anchor boxes are generated on the fly, I am unable to understand the usage of base_anchor_size. In the corresponding paper, there is no mention of such thing. Two questions in short:
What is the use of base_anchor_size parameter? Is it important?
How does this parameter affect the training in the cases where the original input image is square in shape and the case when it isn't square?
In SSD architecture there are scales for anchors which are fixed ahead, e.g. linear values across the range 0.2-0.9. These values are relative to the image size. For example, given 320x320 image, then smallest anchor (with 1:1 ratio) will be 64x64, and largest anchor will be 288x288. However, if you wish to insert to your model a larger image, e.g. 640x640, but without changing the anchor sizes (for example since these are images of far objects, so there's no need for large objects; not leaving the anchor sizes untouched allows you not to fine-tune the model on the new resolution), then you can simply have a base_anchor_size=0.5, meaning the anchor scales would be 0.5*[0.2-0.9] relative to the input image size.
The default value for this parameter is [1.0, 1.0], meaning not having any affect.
The entries correspond to [height, width] relative to the maximal square you can fit in the image, meaning [min(image_height,image_width),min(image_height,image_width)]. So, if for example, your input image is VGA, i.e. 640x480, then the base_anchor_size is taken to be relative to [480,480].

YOLO object detection model?

Currently, I am reading the Yolo9000 model "https://arxiv.org/pdf/1612.08242.pdf" and I am very confused about how the model can predict the bounding box for object detection, I did many examples with Tensorflow, and in most of them we give to the model "Images and Label of images".
My questions are:
1- How we can pass the bounding box instead of labels to the model?
2- How can the the model learn that many boxes belong to one images?
In YOLO, we divide the image into 7X7 grid. For each of the grid location, the network predicts three things -
Probability of an object being present in that grid
If an object lies in this grid, what would be the co-ordinates of the
bounding box?
If an object lies in this grid, which class does it
belong to?
If we apply regression for all the above variables for all 49 grid locations, we will be able tell which grid locations have objects(using first parameter). For the grid locations that have objects, we can tell the bounding box co-ordinates and correct class using the second and third parameters.
Once we have designed a network that can output all the information we need, prepare the training data in this format i.e. find these parameters for every 7X7 grid location in every image in your dataset. Next you simply train the deep neural network to regress for these parameters.
To pass bounding boxes of an image we need to create it first. You can create bounding boxes for any image using specific tools. Here, you have to create boundaries that bound an object within it and then label that bounding box/rectangle. You to do this for every object in the image you want your model to train/recognize.
There is one very useful project in this link, you should check that out if you need to understand about bounding boxes.
I have just started learning object detection with tensorflow. So as and when I get proper info on providing bounding boxes to the object detection model I'll also update that here. Also if you have solved this problem by now, you can also provide the details to help out others facing same kind of problems.
1- How we can pass the bounding box instead of labels to the model?
If we want to train a model that performs object detection (not object classification), we have to pass the truth labels as .xml files, for example. An xml file contains information about objects that exist in an image. Each information about object is composed of 5 values:
class name of this object, such as car or human...
xmin: x coordinate of the box's top left point
ymin: y coordinate of the box's top left point
xmax: x coordinate of the box's bottom right point
ymax: y coordinate of the box'x bottom right point
One bounding box within an image is specified as a set of 5 values like above. If there are 3 objects in an image, the xml file will contain 3 sets of this values.
2- How can the the model learn that many boxes belong to one images?
As you know, the output of YOLOv2 or YOLO9000 has shape (13, 13, D), where D depends on how many class of object you're going to detect. You can see that there are 13x13 = 169 cells (grid cells) and each cell as D values (depth).
Among 169 grid cells, there are some grid cells that are responsible to predict bounding boxes. If the center of a true bounding box falls on a grid cell, this grid cell is responsible to predict that bounding box, when it is given the same image.
I think there must be a function that reads the xml annotation files and determines which grid cells are responsible to detect bounding boxes.
To make the model learn the box positions and shapes not only the classes, we have to build an appropriate loss function. The loss function used in YOLOv puts cost also on the box shapes and positions. So the loss is calculated as the weighted sum of the following individual loss values:
Loss on the class name
Loss on the box position (x-y coordinates)
Loss on the box shape (box width and height)
SIDE NOTE:
Actually, one grid cell can detect up to B boxes, where B depends on
implementations of YOLOv2. I used darkflow to train YOLOv2 on my
custom training data, in which B was 5. So the model can detect 169*B
boxes in total, and loss is the sum of 169*B small losses.
D = B*(5+C), where C is the number of classes you want to detect.
Before passed to the model, the box shapes and positions are
converted into relative values to the image size.