Why "softmax_cross_entropy_with_logits_v2" backprops into labels - tensorflow

I am wondering why in Tensorflow version 1.5.0 and later, softmax_cross_entropy_with_logits_v2 defaults to backpropagating into both labels and logits. What are some applications/scenarios where you would want to backprop into labels?

I saw the github issue below asking the same question, you might want to follow it for future updates.
https://github.com/tensorflow/minigo/issues/37
I don't speak for the developers who made this decision, but I would surmise that they would do this by default because it is indeed used often, and for most application where you aren't backpropagating into the labels, the labels are a constant anyway and won't be adversely affected.
Two common uses cases for backpropagating into labels are:
Creating adversarial examples
There is a whole field of study around building adversarial examples that fool a neural network. Many of the approaches used to do so involve training a network, then holding the network fixed and backpropagating into the labels (original image) to tweak it (under some constraints usually) to produce a result that fools the network into misclassifying the image.
Visualizing the internals of a neural network.
I also recommend people watch the deepviz toolkit video on youtube, you'll learn a ton about the internal representations learned by a neural network.
https://www.youtube.com/watch?v=AgkfIQ4IGaM
If you continue digging into that and find the original paper you'll find that they also backpropagate into the labels to generate images which highly activate certain filters in the network in order to understand them.

Related

How does custom object detection actually work?

I am currently testing out custom object detection using the Tensorflow API. But I don't quite seem to understand the theory behind it.
So if I for example download a version of MobileNet and use it to train on, lets say, red and green apples. Does it forget all the things that is has already been trained on? And if so, why does it then benefit to use MobileNet over building a CNN from scratch.
Thanks for any answers!
Does it forget all the things that is has already been trained on?
Yes, if you re-train a CNN previously trained on a large database with a new database containing fewer classes it will "forget" the old classes. However, the old pre-training can help learning the new classes, this is a training strategy called "transfert learning" of "fine tuning" depending on the exact approach.
As a rule of thumb it is generally not a good idea to create a new network architecture from scratch as better networks probably already exist. You may want to implement your custom architecture if:
You are learning CNN's and deep learning
You have a specific need and you proved that other architectures won't fit or will perform poorly
Usually, one take an existing pre-trained network and specialize it for their specific task using transfert learning.
A lot of scientific literature is available for free online if you want to learn. you can start with the Yolo series and R-CNN, Fast-RCNN and Faster-RCNN for detection networks.
The main concept behind object detection is that it divides the input image in a grid of N patches, and then for each patch, it generates a set of sub-patches with different aspect ratios, let's say it generates M rectangular sub-patches. In total you need to classify MxN images.
In general the idea is then analyze each sub-patch within each patch . You pass the sub-patch to the classifier in your model and depending on the model training, it will classify it as containing a green apple/red apple/nothing. If it is classified as a red apple, then this sub-patch is the bounding box of the object detected.
So actually, there are two parts you are interested in:
Generating as many sub-patches as possible to cover as many portions of the image as possible (Of course, the more sub-patches, the slower your model will be) and,
The classifier. The classifier is normally an already exisiting network (MobileNeet, VGG, ResNet...). This part is commonly used as the "backbone" and it will extract the features of the input image. With the classifier you can either choose to training it "from zero", therefore your weights will be adjusted to your specific problem, OR, you can load the weigths from other known problem and use them in your problem so you won't need to spend time training them. In this case, they will also classify the objects for which the classifier was training for.
Take a look at the Mask-RCNN implementation. I find very interesting how they explain the process. In this architecture, you will not only generate a bounding box but also segment the object of interest.

How to train your own(w/o YOLO etc.) object detector in tf/keras

I successfully trained multi-classificator model, that was really easy with simple class related folder structure and keras.preprocessing.image.ImageDataGenerator with flow_from_directory (no one-hot encoding by hand btw!) after i just compile fit and evaluate - extremely well done pipeline by Keras!
BUT! when i decided to make my own (not cats, not dogs, not you_named) object detector - this is became a nightmare...
TFRecord and tf.Example are just madness! but ok, i almost get it (my dataset is small, i have plenty of ram, but who cares, write f. boilerplate, so much meh...)
The main thing - i just can't find any docs/tutorial how to make it with plain simple tf/keras, everyone just want to build up it on top of someone model, YOLO SSD FRCNN, even if they trying to detect completely new objects!!!
There two links about OD in official docs, and they both using some models underneath.
So my main question WHY ??? or i just blind..? -__-
It becomes a nightmare because Object Detection is way way harder than classification. The most simple object detector is this: first train a classifier on all your objects. Then when you want to detect objects in your image, slide a window over your image, and classify each window. Then, if your classifier is certain that a certain window is one of the objects, mark it as a successful detection.
But this approach has a lot of problems, mainly it's way (like waaaay) too slow. So, researcher improved it and invented RCNNs. That had it problems, so they invented Faster-RCNN, YOLO and SSD, all to make it faster and more accurate.
You won't find any tutorials online on how to implement the sliding window technique because it's not useful anyway, and you won't find any tutorials on how to implement the more advanced stuff because, well, the networks get complicated pretty quick.
Also note that using YOLO doesn't mean you should use the same weights as in YOLO. You can always train YOLO from scratch on your own data if you want by randomly initiliazing all the weights in the network layers. So the even if they trying to detect completely new objects!!! you mentioned isn't really valid. Also also note that I still would advise you to do use the weights they used in Yolo network. Transfer Learning is generally looked at as being a good idea, especially when starting out and especially in the image processing world, as many images share common features (like edges, for example).
I am having pretty much the same problem as my images are B/W diagrams, quite different from regular pictures, I want to train a custom model on just only diagrams.
I have found this documentation section in Tensorflow models repo:
https://github.com/tensorflow/models/blob/master/research/object_detection/README.md
It has a couple of sections explaining how to bring your own model and dataset in "extras" that could be a starting point.

YOLO v3 complete architecture

I am attempting to implement YOLO v3 in Tensorflow-Keras from scratch, with the aim of training my own model on a custom dataset. By that, I mean without using pretrained weights. I have gone through all three papers for YOLOv1, YOLOv2(YOLO9000) and YOLOv3, and find that although Darknet53 is used as a feature extractor for YOLOv3, I am unable to point out the complete architecture which extends after that - the "detection" layers talked about here. After a lot of reading on blog posts from Medium, kdnuggets and other similar sites, I ended up with a few significant questions:
Have I have missed the complete architecture of the detection layers (that extend after Darknet53 used for feature extraction) in YOLOv3 paper somewhere?
The author seems to use different image sizes at different stages of training. Does the network automatically do this upscaling/downscaling of images?
For preprocessing the images, is it really just enough to resize them and then normalize it (dividing by 255)?
Please be kind enough to point me in the right direction. I appreciate the help!

Object detection project (root architecture) using Tensorflow + Keras. Image sample size for accurate training of model?

Im currenty working on a project at University, where we are using python + tensorflow and keras to train an image object detector, to detect different parts of the root system of Arabidopsis.
Our current ressults are pretty bad, as we do only have about 100 images to train the model with at this moment, but we are currently working on cultuvating more plants in order to get more images(more data) to train the tensorflow model.
We have implemented the following Mask_RCNN model:Github- Mask_RCNN tensorflow
We are looking to detect three object clases: stem, main root and secondary root.
But the model detects main roots incorrectly where the secondary roots are located.
It should be able to detect something like this:Root detection example
Training root data set that we are using right now:training images
What is the usual sample size that is used to train a neural network accurate results?
First off: I think there is no simple rule to estimate the sample size but at least it depends on:
1. Quality of your images
I downloaded the images and I think you need to preprocess them before you can use it to reduce the "problem complexity". In some projects, in which I worked with biological data, a background removal (image - low pass filter) was the key to get better results. But you should definitely remove/crop the area outside the region of your interest (like the tape and the ruler). I would try to get the cleanest data set as possible (including manually adjustments cv2/ gimp/ etc.) to focus the network to solve "the right problem".. After that you could apply some random distortion to make it also work on fuzzy/bad/realistic images as well.
2. The way you work with your data
There are a few tricks that enables you to "expand" your dataset.
Sometimes it's very helpful to let a generator method crop random small patches from your input data. This allows you to work with more batches (on small gpus) and gives your network more "variety", (just think about the conv2d task: if you don't use random cropping your filters will slide over the same areas over and over again (at the same image)). Because of the same reason: apply random distortion, flip and rotate your images.
3. Network architecture
In your case I would prefer a U-Net architecture with a last conv2d output of 3 (your classes) feature maps, a final softmax activation and an categorical_crossentropy, this enables you to play with the depth, because sometimes you need sophisticated architectures to solve a problem (close to 100%) but in your case you just want to see a first working result. So fewer layers and a simple architecture could also help you to get things work. Maybe there are some trained network weights for a U-Net which meets your requirements (search on kaggle for example). Because it is also helpful (to reduce the data you need) to use "transfer learning" -> use the first layers of an network (weights) which is already trained. Using a semantic segmentation the first filters will become something like an edge detection for the most given problems/images.
4. Your mental model of "accurate results"
This is the hardest part.. because it evolves during your project. Eg. in the same moment your networks starts to perform well on preprocessed input images you will start to think about architecture/data changes to make it work on fuzzy images as well. This is why you should start with a feasible problem but always improve your dataset (including rare kinds of roots) and tune your network architecture step by step.

What is the reason for new function tf.nn.softmax_cross_entropy_with_logits_v2? [duplicate]

I am wondering why in Tensorflow version 1.5.0 and later, softmax_cross_entropy_with_logits_v2 defaults to backpropagating into both labels and logits. What are some applications/scenarios where you would want to backprop into labels?
I saw the github issue below asking the same question, you might want to follow it for future updates.
https://github.com/tensorflow/minigo/issues/37
I don't speak for the developers who made this decision, but I would surmise that they would do this by default because it is indeed used often, and for most application where you aren't backpropagating into the labels, the labels are a constant anyway and won't be adversely affected.
Two common uses cases for backpropagating into labels are:
Creating adversarial examples
There is a whole field of study around building adversarial examples that fool a neural network. Many of the approaches used to do so involve training a network, then holding the network fixed and backpropagating into the labels (original image) to tweak it (under some constraints usually) to produce a result that fools the network into misclassifying the image.
Visualizing the internals of a neural network.
I also recommend people watch the deepviz toolkit video on youtube, you'll learn a ton about the internal representations learned by a neural network.
https://www.youtube.com/watch?v=AgkfIQ4IGaM
If you continue digging into that and find the original paper you'll find that they also backpropagate into the labels to generate images which highly activate certain filters in the network in order to understand them.