Can I add sample weights to the training dataset? - cntk

Is it possible to add to the configuration (eg, SGD block) sample weights? Where each training example is assigned a weight?
In many cases, a simple 1/N (where N is the total number of samples in the training dataset) is good enough, but in others, I would like the network to put extra "emphasis" in certain examples. Can this be done in CNTK?
Thanks,
Pedro

Yes, it can be done. If your task is binary classification you can use WeightedLogistic (label, probability, weight).
Typically weight will be an Input() which you can hook up to a stream in your data (assuming you are using CNTKTextFormatReader you can just have a |weight stream).
It can also be done for tasks other than binary classification with a little more effort.

Related

Multiple BERT binary classifications on a single graph to save on inference time

I have five classes and I want to compare four of them against one and the same class. This isn't a One vs Rest classifier, as for each output I want to score them against one base class.
The four outputs should be: base class vs classA, base class vs classB, etc.
I could do this by having multiple binary classification tasks, but that's wasting computation time if the first layers are BERT preprocessing + pretrained BERT layers, and the only differences between the four classifiers are the last few layers of BERT (finetuned ones) and the Dense layer.
So why not merge the graphs for more performance?
My inputs are four different datasets, each annotated with true/false for each class.
As I understand it, I can re-use most of the pipeline (BERT preprocessing and the first layers of BERT), as those have shared weights. I should then be able to train the last few layers of BERT and the Dense layer on top differently depending on the branch of the classifier (maybe using something like keras.switch?).
I have tried many alternative options including multi-class and multi-label classifiers, with actual and generated (eg, machine-annotated) labels in the case of multiple input labels, different activation and loss functions, but none of the results were acceptable to me (none were as good as the four separate models).
Is there a solution for merging the four different models for more performance, or am I stuck with using 4x binary classifiers?
When you train DNN for specific task it will be (in vast majority of cases) be better than the more general model that can handle several task simultaneously. Saying that, based on my experience the properly trained general model produces very similar results to the original binary ones. Anyways, here couple of suggestions for training strategies (assuming your training datasets for each task are completely different):
Weak supervision approach
Train your binary classifiers, and label your datasets using them (i.e. label with binary classifier trained on dataset 2 datasets [1,3,4]). Then train your joint model as multilabel task using all the newly labeled datasets (don't forget to randomize samples before feeding them to trainer ;) ). Here you will need to experiment if you will use threshold and set a label to 0/1 or use the scores of the binary classifiers.
Create custom loss function that will not penalize if no information provided for certain class. So when your will introduce sample from (say) dataset 2, your loss will be calculated only for the 2nd class.
Of course you can apply both simultaneously. For example, if you know that binary classifier produces scores that are polarized (most results are near 0 or 1), you can use weak labels, and automatically label your data with scores. Now during the second stage penalize loss such that for score x' = 4(x-0.5)^2 (note that you get logits from the model, so you will need to apply sigmoid function). This way you will increase contribution of the samples binary classifier is confident about, and reduce that of less certain ones.
As for releasing last layers of BERT, usually unfreezing upper 3-6 layers is enough. Releasing more layers improves results very little and increases time and memory requirements.

Can SigmoidFocalCrossEntropy in Tensorflow (tf-addons) be used in Multiclass Classification? ( What is the right way)?

Focal Loss given in Tensorflow is used for class imbalance. For Binary class classification, there are a lots of codes available but for Multiclass classification, a very little help is there. I ran the code with One Hot Encoded target variables of 250 classes and it gave me results without any error.
y = pd.get_dummies(df['target']) # One hot encoded target classes
model.compile(
optimizer="adam", loss=tfa.losses.SigmoidFocalCrossEntropy(), metrics= metric
)
I just want to know whoever wrote this code or someone having enough knowledge of this code, can it be used be used for Multiclass Classification. If no then how come it did not give me errors, instead better results than CrossEntropy. Also, in other implementations like this one, the value of alpha has to be given for every class but just one value in Tensorflow's implementations.
What is the correct way to use this?
Some basics first.
Categorical Crossentropy is designed to incentivize a model a model to predict 100% for the correct label. It was designed for models that predict single-label multi-class classification - like CIFAR10 or Imagenet. Usually these models finish in a Dense layer with more than one output.
Binary Crossentropy is designed to incentivize a model to predict 100% if the label is one, or, 0% is the label is zero. Usually these models finish in a Dense layer with exactly one output.
When you apply Binary Crossentropy to a single-label multi-class classification problem, you are doing something that is mathematically valid but defines a slightly different task: you are incentivizing a single-label classification model to not only get the true label correct, but also minimize the false labels.
For example, if your target is dog, and your model predict 60% dog, CCE doesn't care if your model predicts 20% cat and 20% French horn, or, 40% cat and 0% French horn. So this is aligned with a top-1 accuracy concept.
But if you take that same model and apply BCE, and your model predictions 60% dog, BCE DOES care if your models predict 20%/20% cat/frenchhorn, vs 40%/0% cat/frenchhorn. To put it in precise terminology, the former is more "calibrated" and so it has some additional measure of goodness. However, this has little correlation to top-1 accuracy.
When you use BCE, presumably you are wasting the model's energy to focus on calibration at the expense of top-1 acc. But as you might have seen, it doesn't always work out that way. Sometimes BCE gives you superior results. I don't know that there's a clear explanation of that but I'd assume that the additional signals (in the case of Imagenet, you'll literally get 1000 times more signals) somehow creates a smoother loss value that perhaps helps smooth the gradients you receive.
The alpha value of focal loss additionally penalizes very wrong predictions and lessens the penalty if your model predicts something close to the right answer - like predicting 90% cat if the ground truth is cat. This would be a shift from the original definition of CCE, based on the theory of Maximum Likelihood Estimation... which focuses on calibration... vs the normal metric most ML practitioners care about: top-1 accuracy.
Focal loss was originally designed for binary classification so the original formulation only has a single alpha value. The repo you pointed to extends the concept of Focal Loss to single-label classification and therefore there are multiple alpha values: one per class. However, by my read, it loses the additional possible smoothing effect of BCE.
Net net, for the best results, you'll want to benchmark CCE, BCE, Binary Focal Loss (out of TFA and per the original paper), and the single-label multi-class Focal Loss that you found in that repo. In general, those the discovery of those alpha values is done via guess & check, or grid search.
There's a lot of manual guessing and checking in ML unfortunately.

Once a CNN is trained, should its ouputs be deterministic?

I just trained a CNN with Tensorflow/Keras and saved it as a model. I tried running about 1000 inputs through it multiple times, and each time got a slightly different prediction accuracy. The accuracy was good, and I am not concerned with the performance; however, I thought that CNN models, once trained, should be deterministic. That is, any input will always be classified the same way. Is this not the case? Is there variability in the way a model can predict once trained? If not, hopefully I can assume that I have programmed some variability into my code unawares. Any help would be appreciated.
Once a CNN is trained, should its ouputs be deterministic?
Well, in theory, yes. In practise, as Peter Duniho points out in his excellent explanatory comment, we can see very small deviations because of the way values are calculated, aggregated, etc.
In practice the probability of such small deviations changing the predicted category (and therefore the accuracy) of a classification model are so small that I'd be almost certain something else is at play in your example. Even over a sample size of 1000.
Have you left on some training regularisation like batch normalisation? Are you certain you are evaluating precisely the same 1000 inputs each time? Got to suspect the issue is in the code rather than rounding errors.
Can you determine which specific classification changes?

Batch structure for training a ranking model with contrastive loss?

How do I choose my batch if I train a deep ranking model with a eg. contrastive loss where I have per query 1 positive document and 2 negative samples?
So, it is about ranking (loss) which applies to eg. the quora question pair data or any other question/answer pairs which I want to rank using a deep learning ranking model or just a Siamese network.
The data would look like this: https://github.com/NTMC-Community/MatchZoo/blob/master/matchzoo/datasets/toy/train.csv
Now, I assume that it is crucial how to build the batch, right? Since for every question all according pos and neg answers need to be contained inside a batch, right?
Different strategies can be used to build the batches and the triplets or pairs. Usually, the batches are built randomnly, and then the hardest negative, or one of the hardest negatives in the batch is picked.
So yes, positive and negatives examples need to be contaned inside a batch. And it is crucial to pick negatives. But usually efforts are made to pick the proper negatives inside the batch, instead of in building the batches in a specific way.
This blogpost explaining how ranking losses work may be usefull https://gombru.github.io/2019/04/03/ranking_loss/

Using unlabeled dataset in Keras

Usually, when using Keras, the datasets used to train the neural network are labeled.
For example, if I have a 100,000 rows of patients with 12 field per each row, then the last field will indicate if this patient is diabetic or no (0 or 1).
And then after training is finished I can insert a new record and predict if this person is diabetic or no.
But in the case of unlabeled datasets, where I can not label the data due to some reasons, how can I train the neural network to let him know that those are the normal records and any new record that does not match this network will be malicious or not accepted ?
This is called one-class learning and is usually done by using autoencoders. You train an autoencoder on the training data to reconstruct the data itself. The labels in this case is the input itself. This will give you a reconstruction error. https://en.wikipedia.org/wiki/Autoencoder
Now you can define a threshold where the data is benign or not, depending on the reconstruction error. The hope is that the reconstruction of the good data is better than the reconstruction of the bad data.
Edit to answer the question about the difference in performance between supervised and unsupervised learning.
This cannot be said with any certainty, because I have not tried it and I do not know what the final accuracy is going to be. But for a rough estimate supervised learning will perform better on the trained data, because more information is supplied to the algorithm. However if the actual data is quite different to the training data the network will underperform in practice, while the autoencoder tends to deal better with different data. Additionally, per rule of thumb you should have 5000 examples per class to train a neural network reliably, so labeling could take some time. But you will need some data to test anyways.
It sounds like you need fit two different models:
a model for bad record detection
a model for prediction of a patient's likelihood to be diabetic
For both of these models, you will need to have labels. For the first model your labels would indicate whether the record is good or bad (malicious) and the second would be whether the patient is diabetic or not.
In order to detect bad records, you may find that simple logistic regression or SVM performs adequately.