Input feature to Feature maps - tensorflow

Can anybody please explain this basic thing to me that how does a 192x28x28 input image gets reduced to a 16x28x28 feature maps using a 1x1 conv mapping. My question is about the understanding of what exactly happens when 192 goes to 16 ??
i know about ((I-2P-F)/S)+1, but what happens in the process of reducing depth.

The 1x1 Convolution compresses the whole 192*28*28 input image (which could be read as 192 feature maps of 28px * 28px pixels images) into a single 1*28*28 image. So far it reduces depth in the "feature map axis" to 1 while preserving the height and width of the original image.
But then... why do you get the 16? In a convolutional layer you can have different kernels. Basically each kernel is an indepentent filter with the same size. In your case it looks like your 1x1 Conv layer has 16 kernels by default, hence you get 16 28*28 images (one per kernel).

Related

Understanding 2D convolution output size

I am a beginner in Convolutional DL. I saw the following architecture in paper Simultaneous Feature Learning and Hash Coding with Deep Neural Networks: For images of size 256*256,
I do not understand the output size of the first 2D convolution: 96*54*54. 96 seems fine as the number of filters is 96. But, if we apply the following formula for the output size: size = [(W−K+2P)/S]+1 = [(256 - 11 + 2*0)/4] + 1 = 62.25 ~ 62. I have assumed the padding, P to be 0 as it is not mentioned in the paper anywhere. Keras Conv2D API produces the same 96*62*62 size output. Then, why paper points to 96*54*54? What am I missing?
Well, it reminded me AlexNet paper where there was a similar mistake. Your calculation is correct. I think they mistakenly write 256x256 instead of 224x224, in which case the calculation for the input layer is,
(224-11+2*0)/4 + 1 = 54.25 ~ 54
It's highly possible that authors mistakenly wrote 256x256 instead of the real architecture input size being 224x224 (that was the case in AlexNet also), or the other less possible option is they wrote 256x256 which was the real architecture input size, but do the calculations for 224x224. The latter is ignorable as I think it is a very silly mistake and I don't think that's even an option.
Thus, I believe the true input size was 224x224 instead of 256x256.

Yolo Training: multiple objects in one image

I have a set of training images that contain many small objects (10-20). The image resolution is high (9000x6000).
Is it better to split the image into the specific objects before running yolo training? Or just leave it as is.
Does yolo resize an entire image, or does it ‘extract’ the annotated object first before resizing?
If it is the former, I am concerned that the resolution will be bad. Imagine 20 objects in a 416x416 image.
Does yolo resize an entire image, or does it ‘extract’ the annotated
object first before resizing?
Yes, an entire image will be resized in case of Yolo and it does not extract annotated object before resizing.
Since your input images have very high resolution, what you can do is:
Yolo can handle object sizes of 25 x 25 effectively with network input layer size 608 x 608. So if your object sizes in original input image are greater than 250 x 250 you can train the images as they are (with 608 x 608 network size). In that case even when images are resized to network size, objects will be of size greater than 25x25. This should give you good accuracy.
(6000/600) * 25 = 250
If object sizes in original images are smaller than 200 x 200, split your input image into 8 smaller units/blocks, say blocks/tiles of 2250 x 1500. Train these blocks as individual images. Each bigger image (9000 x 6000) corresponds to 8 training images. Each image might contain zero to many objects. You can operate in sliding window method.
The method you choose for training should be used for inference as well.
For training on objects of all sizes use following models: [Use this if you use original image as it is used for training]
Yolov4-custom
Yolov3-SPP
Yolov3_5l
If all of the objects that you want to detect are of smaller size, then for effective detection use Yolov4 with following changes: [Use this if you split original image into 8 blocks]
Set layers = 23 instead of layers = 54
Set stride=4 instead of stride=2
Set stride=4 instead of stride=2
References:
Refer this relevant GitHub thread
darknet documentation

Tuning first_stage_anchor_generator in faster rcnn model

I am trying to detect some very small object (~25x25 pixels) from large image (~ 2040, 1536 pixels) using faster rcnn model from object_detect_api from here https://github.com/tensorflow/models/tree/master/research/object_detection
I am very confused about the following configuration parameters(I have read the proto file and also tried modify them and test):
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
I am kind of very new to this area, if some one can explain a bit about these parameters to me it would be very appreciated.
My Question is how should I adjust above (or other) parameters to accommodate for the fact that I have very small fix-sized objects to detect in large image.
Thanks
I don't know the actual answer, but I suspect that the way Faster RCNN works in Tensorflow object detection is as follows:
this article says:
"Anchors play an important role in Faster R-CNN. An anchor is a box. In the default configuration of Faster R-CNN, there are 9 anchors at a position of an image. The following graph shows 9 anchors at the position (320, 320) of an image with size (600, 800)."
and the author gives an image showing an overlap of boxes, those are the proposed regions that contain the object based on the "CNN" part of the "RCNN" model, next comes the "R" part of the "RCNN" model which is the region proposal. To do that, there is another neural network that is trained alongside the CNN to figure out the best fit box. There are a lot of "proposals" where an object could be based on all the boxes, but we still don't know where it is.
This "region proposal" neural net's job is to find the correct region and it is trained based on the labels you provide with the coordinates of each object in the image.
Looking at this file, I noticed:
line 174: heights = scales / ratio_sqrts * base_anchor_size[0]
line 175: widths = scales * ratio_sqrts * base_anchor_size[[1]]
which seems to be the final goal of the configurations found in the config file(to generate a list of sliding windows with known widths and heights). While the base_anchor_size is created as a default of [256, 256]. In the comments the author of the code wrote:
"For example, setting scales=[.1, .2, .2]
and aspect ratios = [2,2,1/2] means that we create three boxes: one with scale
.1, aspect ratio 2, one with scale .2, aspect ratio 2, and one with scale .2
and aspect ratio 1/2. Each box is multiplied by "base_anchor_size" before
placing it over its respective center."
which gives insight into how these boxes are created, the code seems to be creating a list of boxes based on the scales =[stuff] and aspect_ratios = [stuff] parameters that will be used to slide over the image. The scale is fairly straightforward and is how much the default square box of 256 by 256 should be scaled before it is used and the aspect ratio is the thing that changes the original square box into a rectangle that is more closer to the (scaled) shape of the objects you expect to encounter.
Meaning, to optimally configure the scales and aspect ratios, you should find the "typical" sizes of the object in the image whatever it is ex(20 by 30, 5 by 10 ,etc) and figure out how much the default of 256 by 256 square box should be scaled to optimally fit that, then find the "typical" aspect ratios of your objects(according to google an aspect ratio is: the ratio of the width to the height of an image or screen.) and set those as your aspect ratio parameters.
Note: it seems that the number of elements in the scales and aspect_ratios lists in the config file should be the same but I don't know for sure.
Also I am not sure about how to find the optimal stride, but if your objects are smaller than 16 by 16 pixels the sliding window you created by setting the scales and aspect ratios to what you want might just skip your object altogether.
As I believe proposal anchors are generated only for model types of Faster RCNN. In this file you have specified what parameters may be set for anchors generation within line you mentioned from config.
I tried setting base_anchor_size, however I failed. Though this FasterRCNNTutorial tutorial mentions that:
[...] you also need to configure the anchor sizes and aspect ratios in the .config file. The base anchor size is 255,255.
The anchor ratios will multiply the x dimension and divide the y dimension, so if you have an aspect ratio of 0.5 your 255x255 anchor becomes 128x510. Each aspect ratio in the list is applied, then the results are multiplied by the scales. So the first step is to resize your images to the training/testing size, then manually check what the smallest and largest objects you expect are, and what the most extreme aspect ratios will be. Set up the config file with values that will cover these cases when the base anchor size is adjusted by the aspect ratios and multiplied by the scales.
I think it's pretty straightforward. I also used this 'workaround'.

Darknet YOLO image size

I am trying to train custom object classifier in Darknet YOLO v2
https://pjreddie.com/darknet/yolo/
I gathered a dataset for images most of them are 6000 x 4000 px and some lower resolutions as well.
Do I need to resize the images before training to be squared ?
I found that the config uses:
[net]
batch=64
subdivisions=8
height=416
width=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1
thats why I was wondering how to use it for different sizes of data sets.
You don't have to resize it, because Darknet will do it instead of you!
It means you really don't need to do that and you can use different image sizes during your training. What you posted above is just network configuration. There should be full network definition as well. And the height and the width tell you what's the network resolution. And it also keeps aspect ratio, check e.g this.
You don't need to resize your database images. PJReddie's YOLO architecture does it by itself keeping the aspect ratio safe (no information will miss) according to the resolution in .cfg file.
For Example, if you have image size 1248 x 936, YOLO will resize it to 416 x 312 and then pad the extra space with black bars to fit into 416 x 416 network.
It is very common to resize images before training. 416x416 is slightly larger than common. Most imagenet models resize and square the images to 256x256 for example. So I would expect the same here. Trying to train on 6000x4000 is going to require a farm of GPUs. The standard process is to square the image to the largest dimension (height, or width), padding with 0's on the shorter side, then resizing using standard image resizing tools like PIL.
You do not need to resize the images, you can directly change the values in darknet.cfg file.
When you open darknet.cfg (yolo-darknet.cfg) file, you can all
hyper-parameters and their values.
As showed in your cfg file images dimensions are (416,416)->(weight,height), you can change the values, so that darknet will automatically resize the images before training.
Since the images have high dimensions, you can adjust batch and sub-division values (lower the values 32,16,8 . it has to be multiples of 2), so that darknet will not crash (memory allocation error)
By default the darknet api changes the size of the images in both inference and training, but in theory any input size w, h = 32 x X where X belongs to a natural number should, W is the width, H the height. By default X = 13, so the input size is w, h = (416, 416). I use this rule with yolov3 in opencv, and it works better the bigger X is.

conv2d on non-rectangular image in Tensorflow

I have dataset of images which are half black in a upper triangular fashion, i.e. all pixels below the main diagonal are black.
Is there a way in Tensorflow to give such an image to a conv2d layer and mask or limit the convolution to only the relevant pixels?
If the black translates to 0 then you don't need to do anything. The convolution will multiply the 0 by whatever weight it has so it's not going to contribute to the result. If it's not you can multiply the data with a binary mask to make them 0.
For all black pixels you will still get any bias term if you have any.
You could multiply the result with a binary mask to 0 out the areas you don't want populated. This way you can also decide to drop results that have too many black cells, like around the diagonal.
You can also write your own custom operation that does what you want. I would recommend against it because you only get a speedup of at most 2 (the other operations will lower it). You probably get more performance by running on a GPU.