Tensorflow/Keras: Cost function that penalizes specific errors/confusions - tensorflow

I have a classification scenario with more than 10 classes where one class is a dedicated "garbage" class. With a CNN I currently reach accuracies around 96%, which is good enough for me.
In this particular application false positives (recognizing "garbage" as any non-garbage class) are a lot worse than confusions between the non-garbage classes or false negatives (recognizing any non-garbage class instead of "garbage"). To reduce these false positives I am looking for a suitable loss function.
My first idea was to use the categorical crossentropy and add a penalty value whenever a false positive is detected: (pseudocode)
loss = categorical_crossentropy(y_true, y_pred) + weight * penalty
penalty = 1 if (y_true == "garbage" and y_pred != "garbage") else 0
My Keras implementation is:
def penalized_cross_entropy(y_true, y_pred, garbage_id=0, weight=1.0):
ref_is_garbage = K.equal(K.argmax(y_true), garbage_id)
hyp_not_garbage = K.not_equal(K.argmax(y_pred), garbage_id)
penalty_ind = K.all(K.stack([ref_is_garbage, hyp_not_garbage], axis=0), axis=0) # logical and
penalty = K.cast(penalty_ind, dtype='float32')
return K.categorical_crossentropy(y_true, y_pred) + weight * penalty
I tried different values for weight but I was not able to reduce the false positives. For small values the penalty has no effect at all (as expected) and for very large values (e.g. weight = 50) the network only ever recognizes a single class.
Is my approach complete nonsense or should that in theory work? (Its my first time working with a non-standard loss function).
Are there other/better ways to penalize such false positive errors? Sadly, most articles focus on binary classification and I could not find much for a multiclass case.
Edit:
As stated in the comments the penalty from above is not differentiable and has therefore no effect on the training upgrades. This was my next attempt:
penalized_cross_entropy(y_true, y_pred, garbage_id=0, weight=1.0):
ngs = (1 - y_pred[:, garbage_id]) # non garbage score (sum of scores of all non-garbage classes)
penalty = y_true[:, garbage_id] * ngs / (1.-ngs)
return K.categorical_crossentropy(y_true, y_pred) + weight * penalty
Here the combined score of all non-garbage classes are added for all samples of the minibatch that are false positives. For samples that are not false positives, the penalty is 0.
I tested the implementation on mnist with a small feedforward network and sgd optimizer using class "5" as "garbage":
With just the crossentropy the accuracy is around 0.9343 and the
"false positive rate" (class "5" images recognized as something else)
is 0.0093.
With the penalized cross entropy (weight 3.0) the accuracy is 0.9378
and the false positive rate is 0.0016
So apparently this works, however I am not sure if its the best approach. Also the adam optimizer does not work well with this loss function, thats why I had to use sgd.

Related

Neural Network Input scaling

I trained a simple fully connected network on CIFAR-10 dataset:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.transforms as transforms
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc1 = nn.Linear(3*32*32, 300, bias=False)
self.fc2 = nn.Linear(300, 10, bias=False)
def forward(self, x):
x = x.reshape(250, -1)
self.x2 = F.relu(self.fc1(x))
x = self.fc2(self.x2)
return x
def train():
# The output of torchvision datasets are PILImage images of range [0, 1].
transform = transforms.Compose([transforms.ToTensor()])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=250, shuffle=True, num_workers=4)
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=args.bs, shuffle=False, num_workers=4)
net = Net()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr=0.02, momentum=0.9, weight_decay=0.0001)
for epoch in range(20):
correct = 0
total = 0
for data in trainloader:
inputs, labels = data
outputs = net(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
acc = 100. * correct / total
This network gets to ~50% test accuracy with the parameters specified, after 20 epochs.
Note that I didn't do any whitening of the inputs (no per channel mean subtraction)
Next I scaled up the model inputs by 255, by replacing outputs = net(inputs) with outputs = net(inputs*255). After this change, the network no longer converges. I looked at the gradients and they seem to grow explosively after just a few iterations, leading to all model outputs being zero. I'd like to understand why this is happening.
Also, I tried scaling down the learning rate by 255. This helps, but the network only gets to ~43% accuracy. Again, I don't understand why this helps, and more importantly why the accuracy is still degraded compared to the original settings.
EDIT: forgot to mention that I don't use biases in this network.
EDIT2: I can recover the original accuracy if I scale down the initial weights in both layers by 255 (in addition to scaling down the learning rate). I also tried to scale down the initial weights only in the first layer, but the network had trouble learning (even when I did scale down the learning rate in both layers). Then I tried scaling down the learning rate only in the first layer - this also didn't help. Finally I tried reducing learning rate in both layer even more (by 255*255) and this suddenly worked. This does not make sense to me - scaling down the initial weights by the same factor the inputs have been scaled up should have completely eliminated any difference from the original network, the input to the second layer is identical. At that point the learning rate should be scaled down in the first layer only, but in practice both layers need significantly lower learning rate...
Scaling up the inputs will lead to exploding gradients because of a few observations:
The learning rate is common to all the weights in a given update step.
Hence, the same scaling factor (ie: the learning rate) is applied to a given weight's cost derivative regardless of it's magnitude, so large and small weights get updated by the same scale.
When the loss landscape is highly erratic, this leads to exploding gradients.(like a snowball effect, one overshot update - in say, the axis of one particular weight - causes another in the opposite direction in the next update which overshoots again and so on..)
The range of values of the pixels are 0 to 255, hence scaling the data by 255 will ensure all inputs are between 0 and 1 and hence more smooth convergence as all the gradients will be uniform with respect to the learning rate. But here you scaled the learning rate which adjusts some of the problems mentioned above but is not as effective as scaling the data itself. This reduces the learning rate hence making convergence time longer, that might be the reason why it reaches 43% at 20 epochs, maybe it needs more epochs..
Also:
CIFAR-10 is a significant step up from something like the MNIST dataset, hence, fully connected neural networks do not have the representation power needed to accurately predict these images. CNNs are the way to go for any image classification task beyond MNIST. ~50% accuracy is the max you can get with a fully connected neural network unfortunately.
Maybe decrease the learning rate by 1/255 ... just a guess

Higher loss penalty for true non-zero predictions

I am building a deep regression network (CNN) to predict a (1000,1) target vector from images (7,11). The target usually consists of about 90 % zeros and only 10 % non-zero values. The distribution of (non-) zero values in the targets vary from sample to sample (i.e. there is no global class imbalance).
Using mean sqaured error loss, this led to the network predicting only zeros, which I don't find surprising.
My best guess is to write a custom loss function that penalizes errors regarding non-zero values more than the prediction of zero-values.
I have tried this loss function with the intend to implement what I have guessed could work above. It is a mean squared error loss in which the predictions of non-zero targets are penalized less (w=0.1).
def my_loss(y_true, y_pred):
# weights true zero predictions less than true nonzero predictions
w = 0.1
y_pred_of_nonzeros = tf.where(tf.equal(y_true, 0), y_pred-y_pred, y_pred)
return K.mean(K.square(y_true-y_pred_of_nonzeros)) + K.mean(K.square(y_true-y_pred))*w
The network is able to learn without getting stuck with only-zero predictions. However, this solution seems quite unclean. Is there a better way to deal with this type of problem? Any advice on improving the custom loss function?
Any suggestions are welcome, thank you in advance!
Best,
Lukas
Not sure there is anything better than a custom loss just like you did, but there is a cleaner way:
def weightedLoss(w):
def loss(true, pred):
error = K.square(true - pred)
error = K.switch(K.equal(true, 0), w * error , error)
return error
return loss
You may also return K.mean(error), but without mean you can still profit from other Keras options like adding sample weights and other things.
Select the weight when compiling:
model.compile(loss = weightedLoss(0.1), ...)
If you have the entire data in an array, you can do:
w = K.mean(y_train)
w = w / (1 - w) #this line compesates the lack of the 90% weights for class 1
Another solution that can avoid using a custom loss, but requires changes in the data and the model is:
Transform your y into a 2-class problem for each output. Shape = (batch, originalClasses, 2).
For the zero values, make the first of the two classes = 1
For the one values, make the second of the two classes = 1
newY = np.stack([1-oldY, oldY], axis=-1)
Adjust the model to output this new shape.
...
model.add(Dense(2*classes))
model.add(Reshape((classes,2)))
model.add(Activation('softmax'))
Make sure you are using a softmax and a categorical_crossentropy as loss.
Then use the argument class_weight={0: w, 1: 1} in fit.

How to adjust Model for rare binary outcome with Tensorflow or GBM

I'm currently working on data with rare binary outcome, i.e. the response vector contains mostly 0 and only a few 1 (approximately 1.5% ones). I've got about 20 continuous explanatory variables. I tried to train models using GBM, Random Forests, TensorFlow with Keras backend.
I observed a special behavior of the models, regardless which method I used:
The accuracy is high (~98%) but the model predicts probabilities for class "0" for all outcomes as ~98.5% and for class "1" ~1,5%.
How can I prevent this behavior?
I'm using RStudio. For Example a TF model with Keras would be:
model <- keras_model_sequential()
model %>%
layer_dense(units = 256, activation = "relu", input_shape = c(20)) %>%
layer_dense(units = 256, activation = "relu") %>%
layer_dense(units = 2, activation = "sigmoid")
parallel_model <- multi_gpu_model(model, gpus=2)
parallel_model %>% compile(
optimizer = "adam",
loss = "binary_crossentropy",
metrics = "binary_accuracy")
histroy <- parallel_model %>% fit(
x_train, y_train,
batch_size = 64,
epochs = 100,
class_weight = list("0"=1,"1"=70),
verbose = 1,
validation_split = 0.2
)
But my observation is not limited to TF. This makes my question more general. I'm not asking for specific adjustments for the model above, rather I'd like to discuss at what point all outcomes are assigned the same probability.
I can guess, the issue is connected to the loss-function.
I know there is no way to use AUC as loss functions, since it's not differentiable. If I test models with AUC with unknown data, the result is not better than random guessing.
I don't mind answers with code in Python, since this isn't a problem about coding rather than general behavior and algorithms.
When your problem has unbalanced classes, I suggest using SMOTE (on the training data only!!! never use smote on your testing data!!!) before training the model.
For example:
from imblearn.over_sampling import SMOTE
X_trn_balanced, Y_trn_balanced = SMOTE(random_state=1, ratio=1).fit_sample(X_trn, Y_trn)
#next fit the model with the balanced data
model.fit(X_trn_balanced, Y_trn_balanced )
In my (not so big) experience with AUC problems and rare positives, I see models with one class (not two). It's either "result is positive (1)" or "result is negative (0)".
Metrics like accuracy are useless for these problems, you should use AUC based metrics with big batch sizes.
For these problems, it doesn't matter whether the outcome probabilities are too little, as long as there is a difference between them. (Forests, GBM, etc. will indeed output these little values, but this is not a problem)
For neural networks, you can try to use class weights to increase the output probabilities. But notice that if you split the result in two separate classes (considering only one class should be positive), it doesn't matter if you use weights, because:
For the first class, low weights: predict all ones is good
For the second class, high weights: predict all zeros is good (weighted to very good)
So, as an initial solution, you can:
Use a 'softmax' activation (to guarantee your model will have only one correct output) and a 'categorical_crossentropy' loss.
(Or, preferrably) Use a model with only one class and keep 'sigmoid' with 'binary_crossentropy'.
I always work with the preferrable option above. In this case, if you use batch sizes that are big enough to contain one or two positive examples (batch size around 100 for you), weights may even be discarded. If the batch sizes are too little and many batches don't contain positive results, you may have too many weight updates towards plain zeros, which is bad.
You may also resample your data and, for instance, multiply by 10 the number of positive examples, so your batches contain more positives and training becomes easier.
Example of AUC metric to determine when training should end:
#in python - considering outputs with only one class
def aucMetric(true, pred):
true= K.flatten(true)
pred = K.flatten(pred)
totalCount = K.shape(true)[0]
values, indices = tf.nn.top_k(pred, k = totalCount)
sortedTrue = K.gather(true, indices)
tpCurve = K.cumsum(sortedTrue)
negatives = 1 - sortedTrue
auc = K.sum(tpCurve * negatives)
totalCount = K.cast(totalCount, K.floatx())
positiveCount = K.sum(true)
negativeCount = totalCount - positiveCount
totalArea = positiveCount * negativeCount
return auc / totalArea

How can I train for high specificity for a single class in Tensorflow?

I have a GRU network that receives a sequence of data, and labels it as class 0 or class 1. I want the model to have a high specificity for class 0 (being at least >= 0.8), while making sure that it still has good sensitivity (hopefully up to nearly 0.5).
How can I do this in Tensorflow? Is there a way to let the loss be determined by the combination of specificity and sensitivity for a single class? I don't really care about the accuracy of predictions for class 1, but, in this case, it's really important that the predictions are right when class 0 is predicted, while still having a fair number of predictions for class 0 (at least predicting class 0 half of the time that the labels would be for class 0 as well).
Any help would be greatly appreciated.
Here is the relevant code I have so far:
states_concat = tf.concat(axis=1, values=states)
logits = tf.layers.dense(states_concat, n_outputs)
softmax = tf.nn.softmax(logits=logits)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=logits)
loss = tf.reduce_mean(xentropy)
#Set up the optimizer with gradient clipping to minimize the chance of exploding values
optimizer = tf.train.MomentumOptimizer(learning_rate=learning_rate, momentum=0.9, use_nesterov=True)
I answered my own question.
For anyone else who runs into this trouble, other terms that are useful are precision and recall, where precision measures: truePositives/(truePositives + falsePositives), and recall measures the sensitivity.
The links that helped me are:
https://www.quora.com/How-do-I-penalize-false-positive-in-deep-learning-tensorflow
https://www.tensorflow.org/api_docs/python/tf/nn/weighted_cross_entropy_with_logits

weighting true positives vs true negatives

This loss function in tensorflow is used as a loss function in keras/tensorflow to weight binary decisions
It weights false positives vs false negatives:
targets * -log(sigmoid(logits)) +
(1 - targets) * -log(1 - sigmoid(logits))
The argument pos_weight is used as a multiplier for the positive targets:
targets * -log(sigmoid(logits)) * pos_weight +
(1 - targets) * -log(1 - sigmoid(logits))
Does anybody have any suggestions how in addition true positives could be weighted against true negatives if the loss/reward of them should not have an equal weight?
First, note that with cross entropy loss, there is some (possibly very very small) penalty to each example (even if correctly classified). For example, if the correct class is 1 and our logit is 10, the penalty will be
-log(sigmoid(10)) = 4*1e-5
This loss (very slightly) pushes the network to produce even higher logit for this case to get its sigmoid even closer to 1. Similarly, for negative class, even if the logit is -10, the loss will push it to be even more negative.
This is usually fine because the loss from such terms is very small. If you would like your network to actually achieve zero loss, you can use label_smoothing. This is probably as close to "rewarding" the network as you can get in the classic setup of minimizing loss (you can obviously "reward" the network by adding some negative number to the loss. That won't change the gradient and training behavior though).
Having said that, you can penalize the network differently for various cases - tp, tn, fp, fn - similarly to what is described in Weight samples if incorrect guessed in binary cross entropy. (It seems like the implementation there is actually incorrect. You want to use corresponding elements of the weight_tensor to weight individual log(sigmoid(...)) terms, not the final output of cross_entropy).
Using this scheme, you might want to penalize very wrong answers much more than almost right answers. However, note that this is already happening to a degree because of the shape of log(sigmoid(...)).