Keras - class_weight vs sample_weights in the fit_generator

Keras - class_weight vs sample_weights in the fit_generator - tensorflow

In Keras (using TensorFlow as a backend) I am building a model which is working with a huge dataset that is having highly imbalanced classes (labels). To be able to run the training process, I created a generator which feeds chunks of data to the fit_generator.
According to the documentation for the fit_generator, the output of the generator can either be the tuple (inputs, targets) or the tuple (inputs, targets, sample_weights). Having that in mind, here are a few questions:
My understanding is that
the class_weight regards the weights of all classes for the entire dataset whereas
the sample_weights regards the weights of all classes for each individual chunk
created by the generator. Is that correct? If not, can someone elaborate on the matter?
Is it necessary to give both the class_weight to the fit_generator and then the sample_weights as an output for each chunk? If yes, then why? If not then which one is better to give?
If I should give the sample_weights for each chunk, how do I map the weights if some of the classes are missing from a specific chunk? Let me give an example. In my overall dataset, I have 7 possible classes (labels). Because these classes are highly imbalanced, when I create smaller chunks of data as an output from the fit_generator, some of the classes are missing from the specific chunk. How should I create the sample_weights for these chunks?

My understanding is that the class_weight regards the weights of all
classes for the entire dataset whereas the sample_weights regards the
weights of all classes for each individual chunk created by the
generator. Is that correct? If not, can someone elaborate on the
matter?
class_weight affects the relative weight of each class in the calculation of the objective function. sample_weights, as the name suggests, allows further control of the relative weight of samples that belong to the same class.
Is it necessary to give both the class_weight to the fit_generator and
then the sample_weights as an output for each chunk? If yes, then why?
If not then which one is better to give?
It depends on your application. Class weights are useful when training on highly skewed data sets; for example, a classifier to detect fraudulent transactions. Sample weights are useful when you don't have equal confidence in the samples in your batch. A common example is performing regression on measurements with variable uncertainty.
If I should give the sample_weights for each chunk, how do I map the
weights if some of the classes are missing from a specific chunk? Let
me give an example. In my overall dataset, I have 7 possible classes
(labels). Because these classes are highly imbalanced, when I create
smaller chunks of data as an output from the fit_generator, some of
the classes are missing from the specific chunk. How should I create
the sample_weights for these chunks?
This is not an issue. sample_weights is defined on a per-sample basis and is independent from the class. For this reason, the documentation states that (inputs, targets, sample_weights) should be the same length.
The function _weighted_masked_objective in engine/training.py has an example of sample_weights are being applied.

Related

Is there linter for model(inputs) of PyTorch like model.predict(inputs) of TensorFlow?

My goal is to do object detection. However, YOLOv7 and (hack to create bounding box with feature map) tutorial is using PyTorch.
The problem is: model(inputs) do not have typings.
The code L148-L150
out = model(inputs)
probs, class_preds = torch.max(out[0], dim=-1)
feature_maps = out[1].to("cpu")
The forced me to debug the helper.py file to understand what [0] and out[1] are. Currently, I assume that out[0] as the softmax probability and out[1] as the feature maps.

I think the answer is no, in general it is non-trivial to automatically infer the semantic meaning of the outputs of a neural network; this is a product of the semantic meaning of the inputs and the model structure itself. You could reference the Yolo model architecture provided in model.py (though as an aside you should not link to external code but rather provide relevant code in your question itself) and investigate the structure of the outputs, then reference the structure of the labeled inputs (as the model by definition is learning to replicate the structure of the labels.)
That being said in your case the output is quite obviously per-class probabilities and class indexes as shown in line 149:
probs, class_preds = torch.max(out[0], dim=-1)
as the outputs from torch.max per pytorch documentation are (maximum value, maximum index).

Keras - sample_weights vs. sample_weight in model.fit?

I am using Keras with a tensorflow backend to train some CNNs for semantic segmentation of biomedical images. I am trying to weight every pixel in my input images during training and believe I am doing so with the data generator I am passing to model.fit.
However, I am a little confused about the meaning of 'sample_weights' vs. 'sample_weight' in the documentation for model.fit.
'sample_weights' is the third optional output from your dataset or image generator - i.e. the output of the generator can either be the tuple (inputs, targets) or the tuple (inputs, targets, sample_weights). I believe this lets me create a mask that weights my samples pixel-by-pixels, but this isn't super clear from the documentation.
'sample_weight' is a separate field that seems to be pretty clearly defined as a weight you can give to every sample. If I understand, this would allow me to give more or less weight to particular images in my training set.
Do I have this right? Thanks.

variational autoencoder with limited data

Im working on a binary classificaton project, and im using VAE (variational autoencoder) to handle the imbalance between the 2 classes by generating new samples for the minority class.
the first class (majority class) contains 20000 samples, and the second one (minority class) contains 500 samples.
After training VAE model on the minority class, i generated new samples for this class and add them to the training set, then i trained two classification models, a model on trained on the imbalanced data (only training set) and the second one trained with training set + data generated by VAE). The problem is the first model is giving results better than the second(f1-score, Roc auc...), and i thought that maybe the problem was because of the limited amount of data that the VAE was trained on.
Any help please.

Though 500 training Images are not good enough to generate diversified images from a VAE, you can still try producing some. It's better to take mean of latents of 10 different images (or even more) and pass it through the decoder ( if you're already doing this, ignore it. If you're doing some other method, try this).
If it's still not working, then, I suggest you to build a Conditional VAE on your entire dataset. In conditional VAE, you train VAE using the labels so that your models learns not only reconstruction but also what class of image it is reconstructing. This helps you to generate an Image of any particular class.

Multiple BERT binary classifications on a single graph to save on inference time

I have five classes and I want to compare four of them against one and the same class. This isn't a One vs Rest classifier, as for each output I want to score them against one base class.
The four outputs should be: base class vs classA, base class vs classB, etc.
I could do this by having multiple binary classification tasks, but that's wasting computation time if the first layers are BERT preprocessing + pretrained BERT layers, and the only differences between the four classifiers are the last few layers of BERT (finetuned ones) and the Dense layer.
So why not merge the graphs for more performance?
My inputs are four different datasets, each annotated with true/false for each class.
As I understand it, I can re-use most of the pipeline (BERT preprocessing and the first layers of BERT), as those have shared weights. I should then be able to train the last few layers of BERT and the Dense layer on top differently depending on the branch of the classifier (maybe using something like keras.switch?).
I have tried many alternative options including multi-class and multi-label classifiers, with actual and generated (eg, machine-annotated) labels in the case of multiple input labels, different activation and loss functions, but none of the results were acceptable to me (none were as good as the four separate models).
Is there a solution for merging the four different models for more performance, or am I stuck with using 4x binary classifiers?

When you train DNN for specific task it will be (in vast majority of cases) be better than the more general model that can handle several task simultaneously. Saying that, based on my experience the properly trained general model produces very similar results to the original binary ones. Anyways, here couple of suggestions for training strategies (assuming your training datasets for each task are completely different):
Weak supervision approach
Train your binary classifiers, and label your datasets using them (i.e. label with binary classifier trained on dataset 2 datasets [1,3,4]). Then train your joint model as multilabel task using all the newly labeled datasets (don't forget to randomize samples before feeding them to trainer ;) ). Here you will need to experiment if you will use threshold and set a label to 0/1 or use the scores of the binary classifiers.
Create custom loss function that will not penalize if no information provided for certain class. So when your will introduce sample from (say) dataset 2, your loss will be calculated only for the 2nd class.
Of course you can apply both simultaneously. For example, if you know that binary classifier produces scores that are polarized (most results are near 0 or 1), you can use weak labels, and automatically label your data with scores. Now during the second stage penalize loss such that for score x' = 4(x-0.5)^2 (note that you get logits from the model, so you will need to apply sigmoid function). This way you will increase contribution of the samples binary classifier is confident about, and reduce that of less certain ones.
As for releasing last layers of BERT, usually unfreezing upper 3-6 layers is enough. Releasing more layers improves results very little and increases time and memory requirements.

For what are responsible weights?

I'm reading the google ML crash course and have one question.
What is a weight? (I understand that this is a slope in a plot, but it doesn't fit into my understanding)
I also don't understand an impact of weight on the model prediction (for example, in this playground)
Many thanks for the help.

Every layer in a model is a huge mathematical function with many "unknown" variables.
When you build a model, you build a monster function (with thousands or millions of unknown variables) that gives an output from an input.
Something like this:
output_tensor = huge_function(your_input_tensor,var1,var2,var3,var4.......,var10000000)
These variables are the weights. At the beginning, they receive random values, and obviously your function gives you terrible results.
As you train, you adjust the values of these variables so that your results improve.
Weights are such variables, the ones in the model that you are going to adjust so that your huge function brings you good results.
Weights x Biases
Depending on what you are reading, or what program you're using, they will be called weights. According to what I wrote above, both fit the description.
But usually:
Weights - Multiply the inputs
Biases - Are added to the multiplied outputs
So, the usual layers (with some important differences, of course), perform operations like:
output_matrix = input_matrix x weights + biases
Nothing prevents you from creating custom operations, though, where your variables/weights neither multiply nor add.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas