How does the bounding box generated in YOLO v1? - yolo

I wonder how the YOLOv1 network makes their first two bounding boxes.
Are they from the pretrained network or just randomly generated?
Many of the explanations just explain the way of chosing one box that has more higher IoU.
I've saw other comment explaining that each two bounding box is generated from each of the two fully connected layers, and I get that.
Also, I do know YOLOv2 uses k-means clustering for making bounding boxes, and SSD have their formula to first generate their anchor boxes.
I would be grateful if someone gives me a hint.

Related

What form should the output layer of a deep learning network look like for multi-object bounding box regression?

I am building a neural network on the back of Mobilenet SSD v2 and its specifically for bounding box regression. I have had a difficult time looking for clear resources indicating how the output of the model should be shaped. My data generally has 1-4 boxes present in any given image and I could simply concatenate so the output is Dense(16) but what about the instance when there are more than 4 objects present in the image. I am unsure how to handle a dynamic multi-object output layer, how can I do this, are there any detailed resources that can be shared?

How do I train a CNN to learn bounding boxes from labeled images?

I am trying to detect faces of specific people in images. Two people. I have 2k images of each of the two people that are labelled. These are normal snapshots, so there are other people in the images. In some cases, both people appear in the same image.
I used Adobe Lightroom's face detection to label the images. In retrospect, this was a mistake; it's very limited. For example, after labelling enough images, LR guesses at the labels that should be applied. These guesses are rather good. However, you must confirm each guess before it can be used to select photos.
Since I already have a substantial corpus of labeled images, I hoped there was a way I could learn the bounding boxes from the labeled images, rather than use something like labelImg to manually redo work I have already performed.
Ideally, I'm looking for a TensorFlow model that I can load and run on the labelled images where the output is the bounding box. If this is a fool's errand, I would also like to know that.

How to get bounding box for each symbol on number plate

I want to train some neural network to detect symbols on a car license plate.
I got 10k pictures with plates, and 10k strings, that contains text, represented on plates. For example, this picture, has name:"В394ТТ64.png" (others pictures has +- same quality and size, but different shadows\contrast\light and stuff).
So, what do i want to do?
I want to automatically create PASCAL VOC xml files, containing information about each symbol on a plate. Then I want to train neural network to detect symbols and their classes. I already know which symbols appear on picture, but I don't know how to get bounding box coordinates.
I tried to use OpenCV and binary segmentation, but lightning, shadows, size and noise on pictures are too various.
Also, I tried to find trained neural networks, that can detect symbols, or train one by myself, but failed.
So, how can I get bounding box for each symbol on a license plate?
there are multiple methods to do this.
Manly, you will have to go over your image and do object detection on each segment of the image.
In Your case that should be more easy as it is already a defined area. Probably move from left to right in strides.
Using an MNIST trained classifier, you can classify the number on the image part. If you get a result with p of e.g., 90% you get the coordinates from that part of the image as your boundingbox coordiantes.
You can of course reuse known architectures such as R-CNN or Yolo
Here you can find a nice overview.
Good luck
Found another way to solve this problem.
I wrote a script, that generates different images with number plates and xml files for each image. I generated 10k images.
Then i augmented them so they look more like "real world" images. Now i have 14k images. 4 from original set, and 10k augmented.
Trained ssd_mobilenet model.
After, i used autoannotation to detect boxes on real images
Trained model one more time, and that's it.

Counting Pedestrians Using TensorFlow's Object Detection

I am new to machine learning field and based on what I have seen on youtube and read on internet I conjectured that it might be possible to count pedestrians in a video using tensorflow's object detection API.
Consequently, I did some research on tensorflow and read documentation about how to install tensorflow and then finally downloaded tensorflow and installed it. Using the sample files provided on github I adapted the code related to object_detection notebook provided here ->https://github.com/tensorflow/models/tree/master/research/object_detection.
I executed the adapted code on the videos that I collected while making changes to visualization_utils.py script so as to report number of objects that cross a defined region of interest on the screen. That is I collected bounding boxes dimensions (left,right,top, bottom) of person class and counted all the detection's that crossed the defined region of interest (imagine a set of two virtual vertical lines on video frame with left and right pixel value and then comparing detected bounding box's left & right values with predefined values). However, when I use this procedure I am missing on lot of pedestrians even though they are detected by the program. That is the program correctly classifies them as persons but sometimes they don't meet the criteria that I defined for counting and as such they are not counted. I want to know if there is a better way of counting unique pedestrians using the code rather than using the simplistic method that I am trying to develop. Is the approach that I am using the right one ? Could there be other better approaches ? Would appreciate any kind of help.
Please go easy on me as I am not a machine learning expert and just a novice.
You are using a pretrained model which is trained to identify people in general. I think you're saying that some people are pedestrians whereas some other people are not pedestrians, for example, someone standing waiting at the light is a pedestrian, but someone standing in their garden behind the street is not a pedestrian.
If I'm right, then you've reached the limitations of what you'll get with this model and you will probably have to train a model yourself to do what you want.
Since you're new to ML building your own dataset and training your own model probably sounds like a tall order, there's a learning curve to be sure. So I'll suggest the easiest way forward. That is, use the object detection model to identify people, then train a new binary classification model (about the easiest model to train) to identify if a particular person is a pedestrian or not (you will create a dataset of images and 1/0 values to identify them as pedestrian or not). I suggest this because a boolean classification model is about as easy a model as you can get and there are dozens of tutorials you can follow. Here's a good one:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/3_NeuralNetworks/neural_network.ipynb
A few things to note when doing this:
When you build your dataset you will want a set of images, at least a few thousand along with the 1/0 classification for each (pedestrian or not pedestrian).
You will get much better results if you start with a model that is pretrained on imagenet than if you train it from scratch (though this might be a reasonable step-2 as it's an extra task). Especially if you only have a few thousand images to train it on.
Since your images will have multiple people in it you have a problem of identifying which person you want the model to classify as a pedestrian or not. There's no one right way to do this necessarily. If you have a yellow box surrounding the person the network may be successful in learning this notation. Another valid approach might be to remove the other people that were detected in the image by deleting them and leaving that area black. Centering on the target person may also be a reasonable approach.
My last bullet-point illustrates a problem with the idea as it's I've proposed it. The best solution would be to alter the object detection network to ouput both a bounding box per-person, and also a pedestrian/non pedestrian classification with it; or to only train the model to identify pedestrians, specifically, in the first place. I mention this as more optimal, but I consider it a more advanced task than my first suggestion, and a more complex dataset to manage. It's probably not the first thing you want to tackle as you learn your way around ML.

Why "softmax_cross_entropy_with_logits_v2" backprops into labels

I am wondering why in Tensorflow version 1.5.0 and later, softmax_cross_entropy_with_logits_v2 defaults to backpropagating into both labels and logits. What are some applications/scenarios where you would want to backprop into labels?
I saw the github issue below asking the same question, you might want to follow it for future updates.
https://github.com/tensorflow/minigo/issues/37
I don't speak for the developers who made this decision, but I would surmise that they would do this by default because it is indeed used often, and for most application where you aren't backpropagating into the labels, the labels are a constant anyway and won't be adversely affected.
Two common uses cases for backpropagating into labels are:
Creating adversarial examples
There is a whole field of study around building adversarial examples that fool a neural network. Many of the approaches used to do so involve training a network, then holding the network fixed and backpropagating into the labels (original image) to tweak it (under some constraints usually) to produce a result that fools the network into misclassifying the image.
Visualizing the internals of a neural network.
I also recommend people watch the deepviz toolkit video on youtube, you'll learn a ton about the internal representations learned by a neural network.
https://www.youtube.com/watch?v=AgkfIQ4IGaM
If you continue digging into that and find the original paper you'll find that they also backpropagate into the labels to generate images which highly activate certain filters in the network in order to understand them.