Matterport Mask RCNN TRAIN_ROI_PER_IMAGE explanation?

Matterport Mask RCNN TRAIN_ROI_PER_IMAGE explanation? - tensorflow

I have been trying to train a breast cancer segmentation model with mask rcnn. I have been able to understand almost all the hyperparameter but this one variable TRAIN_ROI_PER_IMAGE I just can't seem to wrap my head around it and there's little to no documentation available for it.
If anyone could please explain it to me, it would be super helpful for my research.

TRAIN_ROI_PER_IMAGE - means how many Region of Interest or ROI proposals will be fed to the mask head or the classifier.
img_src : Ren et al., 2016
Concretely, This setting is like the batch size for the second stage of the model.

Related

Can SigmoidFocalCrossEntropy in Tensorflow (tf-addons) be used in Multiclass Classification? ( What is the right way)?

Focal Loss given in Tensorflow is used for class imbalance. For Binary class classification, there are a lots of codes available but for Multiclass classification, a very little help is there. I ran the code with One Hot Encoded target variables of 250 classes and it gave me results without any error.
y = pd.get_dummies(df['target']) # One hot encoded target classes
model.compile(
optimizer="adam", loss=tfa.losses.SigmoidFocalCrossEntropy(), metrics= metric
)
I just want to know whoever wrote this code or someone having enough knowledge of this code, can it be used be used for Multiclass Classification. If no then how come it did not give me errors, instead better results than CrossEntropy. Also, in other implementations like this one, the value of alpha has to be given for every class but just one value in Tensorflow's implementations.
What is the correct way to use this?

Some basics first.
Categorical Crossentropy is designed to incentivize a model a model to predict 100% for the correct label. It was designed for models that predict single-label multi-class classification - like CIFAR10 or Imagenet. Usually these models finish in a Dense layer with more than one output.
Binary Crossentropy is designed to incentivize a model to predict 100% if the label is one, or, 0% is the label is zero. Usually these models finish in a Dense layer with exactly one output.
When you apply Binary Crossentropy to a single-label multi-class classification problem, you are doing something that is mathematically valid but defines a slightly different task: you are incentivizing a single-label classification model to not only get the true label correct, but also minimize the false labels.
For example, if your target is dog, and your model predict 60% dog, CCE doesn't care if your model predicts 20% cat and 20% French horn, or, 40% cat and 0% French horn. So this is aligned with a top-1 accuracy concept.
But if you take that same model and apply BCE, and your model predictions 60% dog, BCE DOES care if your models predict 20%/20% cat/frenchhorn, vs 40%/0% cat/frenchhorn. To put it in precise terminology, the former is more "calibrated" and so it has some additional measure of goodness. However, this has little correlation to top-1 accuracy.
When you use BCE, presumably you are wasting the model's energy to focus on calibration at the expense of top-1 acc. But as you might have seen, it doesn't always work out that way. Sometimes BCE gives you superior results. I don't know that there's a clear explanation of that but I'd assume that the additional signals (in the case of Imagenet, you'll literally get 1000 times more signals) somehow creates a smoother loss value that perhaps helps smooth the gradients you receive.
The alpha value of focal loss additionally penalizes very wrong predictions and lessens the penalty if your model predicts something close to the right answer - like predicting 90% cat if the ground truth is cat. This would be a shift from the original definition of CCE, based on the theory of Maximum Likelihood Estimation... which focuses on calibration... vs the normal metric most ML practitioners care about: top-1 accuracy.
Focal loss was originally designed for binary classification so the original formulation only has a single alpha value. The repo you pointed to extends the concept of Focal Loss to single-label classification and therefore there are multiple alpha values: one per class. However, by my read, it loses the additional possible smoothing effect of BCE.
Net net, for the best results, you'll want to benchmark CCE, BCE, Binary Focal Loss (out of TFA and per the original paper), and the single-label multi-class Focal Loss that you found in that repo. In general, those the discovery of those alpha values is done via guess & check, or grid search.
There's a lot of manual guessing and checking in ML unfortunately.

YOLOv4 loss too high

I am using YOLOv4-tiny for a custom dataset of 26 classes that I collected from Open Images Dataset. The dataset is almost balanced(850 images per class but different number of bounding boxes). When I used YOLOv4-tiny to train on just 3 classes the loss was near 0.5, it was fairly accurate. But for 26 classes as soon as the loss goes below 2 the model starts to overfit. The prediction are also very inaccurate.
I have tried to change the parameters like the learning rate, the momentum and the size but whatever I do the models becomes worse then before. Using regular YOLOv4 model rather then YOLO-tiny does not help either. How can I bring the loss further down?

Have you tried training with mAP? You can take a subset of your training set and make it the validation set. This can be done in the same way you made your training and test set. Then, you can run darknet.exe detector train data/obj.data yolo-obj.cfg yolov4.conv.137 -map. This will keep track of the loss in your validation set. When the error in the validation say goes up, this is the time to stop training and prevent overfitting (this is called: early stopping).

You need to run the training for (classes*2000)iterations. However, for the best scores, you need to train your model for at least 6000 iterations (also known as max_batches). Also please remember if you are using a b&w image, change the channels=3 to channels=1. You can stop your training once the avg loss becomes something like this: 0.XXXX.
Here's my mAP graph for 6000 iterations that ran for 6.2 hours:
avg loss with 6000 max_batches.
Moreover, you can follow this FAQ documentation here by Stéphane Charette.

Is it possible to train a NN in Keras with features that won't be available for prediction?

I'm fairly new to this topic as a whole and struggle to wrap my head even the basics of neural networks in general. Not looking for a project plan, appreciate that you probably have better things to do.
Nonetheless, any idea or push in the right direction is appreciated.
Imaging a grey-box model of some kind, thermal network, electrical network, so on, it's desirable to predict returns based on a very few features with an underlying smart model that is trained with a much bigger dataset.
My question would be if it is possible to train a model with features and define mandatory and some sort of good-to-have features for the predictions?
Any tips are appreciated.
Cheers

Yes, you can train your model like that. But you must feed all the features during prediction. For example, you have 30 mandatory features and 10 optional features. The total is 40. You must feed all the 40 features to get a prediction from your model. Input data shape must be the same always. But we asked for optional features, but why I'm being forced now? well, I will discuss two options.
Option 1: set input shape to None. If you set the input shape to None, your model will accept any shape of input. But if you will have to handle some stuff. You can't use the MaxPooling layer. If you really need to use MaxPool, you will need to calculate input, output shape for all the layers only using the mandatory feature shapes(minimum input shape). If you calculate using (mandatory_feature+optional feature), you will end up with an error. Because the shape of the input of the maxpool layer can become too small and can't be reduced. Take care of that stuff and you're good to go.
Option 2: I will give you an example. I was using the OpenPose output dataset to classify some movements. OpenPose output = 18 bone key points, 36 features including x and y coordinates. These key points were being extracted from live camera frames. But we can't say all the body parts of a human will be always inside the frame. When someone's legs are outside of the frame, we can't get their legs keypoints. But still, we need to classify. There were a lot of options. We could replace the missing keypoints with 0 or find median/mean of such poses and use that value as key points. We found the best case by analyzing all the data. If you're going with option 2, I will suggest you analyze the data first, then decide how you're going to handle the missing field.

Training a model to achieve DLib's facial landmarks like feature points for hands and it's landmarks

[I'm a noob in Machine Learning and OpenCV]
These below are the results i.e. 68 facial landmarks that you get on applying the DLib's Facial Landmarks model that can be found here.
It mentions in this script that the models was trained on the on the iBUG 300-W face landmark dataset.
Now, I wish to create a similar model for mapping the hand's landmarks. I have the hand dataset here.
What I don't get is:
1. how am I supposed to train the model on those positions? Would I have to manually mark each joint in every single image or is there an optimised way for this?
2. In DLib's model, each facial landmark position has a particular value for e.g., the right eyebrows are 22, 23, 24, 25, 26 respectively. At what point would they have been given those values?
3. Would training those images on DLib's shape predictor training script suffice or would I have to train the model on other frameworks (like Tensorflow + Keras) too?

how am I supposed to train the model on those positions? Would I have to manually mark each joint in every single image or is there an optimised way for this?
-> Yes, you should do it all manually. Detecting hand location, defining how many points you need to describe shape.
In DLib's model, each facial landmark position has a particular value for e.g., the right eyebrows are 22, 23, 24, 25, 26 respectively. At what point would they have been given those values?
-> From learning step, for example you want 3 points for each finger, and 2 other points for wrist, so total 15 + 2 = 17 points.
Depends on how you defined which points belong to which finger, e.g point[0] to point[2] is for thumb. and so on.
Would training those images on DLib's shape predictor training script suffice or would I have to train the model on other frameworks (like Tensorflow + Keras) too?
-> With dlib you can do everything.

#thachnb's answer above answers everything.
Labelling has to be done manually. With DLib, Imglab comes with the source and can be easily built with CMake. It allows for:
Labelling with boxes
Annotating/denoting parts (features) to an object (object of interest) For eg: face is an object and the features/landmark positions it's parts.
I was recommended Amazon Mechanical Turk a lot during my research.
The features are given those values during their labelling. You can be creative in those naming conventions until you are consistent.
As only landmarks'/feature points training is required here, DLib's pre-built Shape Predictor Trainer would suffice here. Dlib has pretty in-depth documentation so it won't be hard to follow along.
Additionally you might need some good hand data-set resources for this. Below are some great suggestions:
CVOnline: Image Databases
TwentyBN Dataset
Silesian University of Technology's Database
Hope this works out for others.

How should I structure my labels for TensorFlow?

I'm trying to use TensorFlow to train output servo commands given an input image.
I plan on using a file as #mrry suggested in this question, with the images like so:
../some/path/some_img.JPG *some_label*
My question is, what are the label formats I can provide to TensorFlow and what structures are suggested?
My data is basically n servo commands from 0-10 seconds. A vector would work great:
[0,2,4,3]
or similarly:
[0,.25,.4,.3]
I couldn't find much about labels in the docs. Can anyone shed any light on TensorFlow labels?
And a very related question is what is the best way to structure these for TensorFlow to properly learn from them?

In Tensorflow Labels are just generic tensor. You can use any kind of tensor to store your labels. In your case a 1-D tensor with shape (4,) seems to be desired.
Labels do only differ from the rest of the data by its use in the computational graph. (Usually) labels should only be used inside the loss function while you propagate the other data through the whole network. For your problem a 4-d regression function should work.
Also, look at my newest comment to the (old) question. Using the slice_input_producer seems to be preferable in your case.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas