Does boosting increase weights of low performing/high error trees of increase weights of high performing/low error trees? - xgboost

Does boosting increase weights of low performing/high error trees of increase weights of high performing/low error trees?
I keep seeing both definitions online and not sure which one is correct.

Related

Loss function for binary classification with problem of data imbalance

I try to segment of multiple sclerosis lesions in MR images using deep convolutional neural networks with keras. In this task, each voxel must be classified, either as a lesion voxel or healthy voxel.
The challenge of this task is data imbalance that number of lesion voxels is less than number of healthy voxels and data is extremely imbalanced.
I have a small number of training data and I can not use the sampling techniques. I try to select appropriate loss function to classify voxels in these images.
I tested focal loss, but I could not tuning gamma parameter in this loss function.
Maybe someone help me that how to select appropriate loss function for this task?
Focal loss is indeed a good choice, and it is difficult to tune it to work.
I would recommend using online hard negative mining: At each iteration, after your forward pass, you have loss computed per voxel. Before you compute gradients, sort the "healthy" voxels by their loss (high to low), and set to zero the loss for all healthy voxels apart from the worse k (where k is about 3 times the number of "lesion" voxels in the batch).
This way, gradients will only be estimated for a roughly balanced set.
This video provides a detailed explanation how class imbalance negatively affect training, and how to use online hard negative mining to overcome it.

Expected accuracy of the pet example using SSD model in TensorFlow object detection API?

I'm using default pipeline configuration (ssd_inception_v2_pets.config) and pretrained inception v2 COCO model. In TensorBoard, the loss continues decreasing, but the average precision isn't getting any better. Has anyone done similar experiment using inception v2 for SSD? What's your experience?
The reason of the low mAP is due to extreme low score threshold at non maximum suppression step. The result of such low threshold is that almost every image would produces more than 70 detections, while there's only one ground truth in each image. Changing this threshold to a more reasonable value -- 0.1 -- produces much better mAP plot.
You may have been bitten by this issue: https://github.com/tensorflow/models/issues/2749
as I was.
Try updating & grabbing one of the new starting pretrained files, and seeing if your problem is resolved.

Tensorflow: When to stop training due to the best (minimum) cost?

I am using TensorFlow to classify images using LeNet network. I use AdamOptimizer to minimize the cost function. When I start to train the model, I can observe that the training accuracy and validation accuracy and also the cost is changing, sometimes reducing and sometimes increasing.
My questions: When should we stop the training? How can we know that the optimizer will find the minimum cost? how many iterations should we do the training? Can we set a variable or condition to stop at the minimum cost?
My solution is to define a global variable (min_cost) and in each iteration check if the cost is reducing then save the session and replace the min_cost with the new cost. At the end, I will have the saved session for the minimum cost,
Is this a correct approach?
Thanks in advance,
While training neural networks, mostly a target error is defined alongside with a maximum amount of iterations to train. So for example, a target error could be 0.001MSE. Once this error has been reached, the training will stop - if this error has not been reached after the maximum amount of iterations, the training will also stop.
But it seems like you want to train until you know the network can't do any better. Saving the 'best' parameters like you're doing is a fine approach, but do realise that once some kind of minimum cost has been reached, the error won't fluctuate that much anymore. It won't be like the error suddenly goes up significantly, so it is not completely necessary to save the network.
There is no such thing as 'minimal cost' - the network is always trying to go to some local minima, and it will always be doing so. There is not really way you (or an algorithm) can figure out that there is no better error to be reached anymore.
tl;dr - just set a target reasonable target error alongside with a maximum amount of iterations.

Multi GPU architecture, gradient averaging - less accurate model?

When I execute the cifar10 model as described at https://www.tensorflow.org/tutorials/deep_cnn I achieve 86% accuracy after approx 4 hours using a single GPU , when I utilize 2 GPU's the accuracy drops to 84% but reaching 84% accuracy is faster on 2 GPU's than 1.
My intuition is
that average_gradients function as defined at https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py returns a less accurate gradient value as an average of gradients will be less accurate than the actual gradient value.
If the gradients are less accurate then the parameters than control the function that is learned as part of training is less accurate. Looking at the code (https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py) why is averaging the gradients over multiple GPU's less accurate than computing the gradient on a single GPU ?
Is my intuition of averaging the gradients producing a less accurate value correct ?
Randomness in the model is described as :
The images are processed as follows:
They are cropped to 24 x 24 pixels, centrally for evaluation or randomly for training.
They are approximately whitened to make the model insensitive to dynamic range.
For training, we additionally apply a series of random distortions to artificially increase the data set size:
Randomly flip the image from left to right.
Randomly distort the image brightness.
Randomly distort the image contrast.
src : https://www.tensorflow.org/tutorials/deep_cnn
Does this have an effect on training accuracy ?
Update :
Attempting to investigate this further, the loss function value training with different number of GPU's.
Training with 1 GPU : loss value : .7 , Accuracy : 86%
Training with 2 GPU's : loss value : .5 , Accuracy : 84%
Shouldn't the loss value be lower for higher for higher accuracy, not vice versa ?
In the code you linked, using the function average_gradient with 2 GPUs is exactly equivalent (1) to simply using 1 GPU with twice the batch size.
You can see it in the definition:
grad = tf.concat(axis=0, values=grads)
grad = tf.reduce_mean(grad, 0)
Using a larger batch size (given the same number of epochs) can have any kind of effect on your results.
Therefore, if you want to do exactly equivalent (1) calculations in 1-GPU or 2-GPU cases, you may want to halve the batch size in the latter case. (People sometimes avoid doing it, because smaller batch sizes may also make the computation on each GPU slower, in some cases)
Additionally, one needs to be careful with learning rate decay here. If you use it, you want to make sure the learning rate is the same in the nth epoch in both 1-GPU and 2-GPU cases -- I'm not entirely sure this code is doing the right thing here. I tend to print the learning rate in the logs, something like
print sess.run(lr)
should work here.
(1) Ignoring issues related to pseudo-random numbers, finite precision or data set sizes not divisible by the batch size.
There is a decent discussion of this here (not my content). Basically when you distribute SGD, you have to communicate gradients back and forth somehow between workers. This is inherently imperfect, and so your distributed SGD typically diverges from a sequential, single-worker SGD at least to some degree. It is also typically faster, so there is a trade off.
[Zhang et. al., 2015] proposes one method for distributed SGD called elastic-averaged SGD. The paper goes through a stability analysis characterizing the behavior of the gradients under different communication constraints. It gets a little heavy, but it might shed some light on why you see this behavior.
Edit: regarding whether the loss should be lower for the higher accuracy, it is going to depend on a couple of things. First, I am assuming that you are using softmax cross-entropy for your loss (as stated in the deep_cnn tutorial you linked), and assuming accuracy is the total number of correct predictions divided by the total number of samples. In this case, a lower loss on the same dataset should correlate to a higher accuracy. The emphasis is important.
If you are reporting loss during training but then report accuracy on your validation (or testing) dataset, it is possible for these two to be only loosely correlated. This is because the model is fitting (minimizing loss) to a certain subset of your total samples throughout the training process, and then tests against new samples that it has never seen before to verify that it generalizes well. The loss against this testing/validation set could be (and probably is) higher than the loss against the training set, so if the two numbers are being reported from different sets, you may not be able to draw comparisons like "loss for 1 GPU case should be lower since its accuracy is lower".
Second, if you are distributing the training then you are calculating losses across multiple workers (I believe), but only one accuracy at the end, again against a testing or validation set. Maybe the loss being reported is the best loss seen by any one worker, but overall the average losses were higher.
Basically I do not think we have enough information to decisively say why the loss and accuracy do not seem to correlate the way you expect, but there are a number of ways this could be happening, so I wouldn't dismiss it out of hand.
I've also encountered this issue.
See Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour from Facebook which addresses the same issue. The suggested solution is simply to scale up the learning rate by k (after some reasonable warm-up epochs) for k GPUs.
In practice I've found out that simply summing up the gradients from the GPUs (rather than averaging them) and using the original learning rate sometimes does the job as well.

How to interpret increase in both loss and accuracy

I have run deep learning models(CNN's) using tensorflow. Many times during the epoch, i have observed that both loss and accuracy have increased, or both have decreased. My understanding was that both are always inversely related. What could be scenario where both increase or decrease simultaneously.
The loss decreases as the training process goes on, except for some fluctuation introduced by the mini-batch gradient descent and/or regularization techniques like dropout (that introduces random noise).
If the loss decreases, the training process is going well.
The (validation I suppose) accuracy, instead, it's a measure of how good the predictions of your model are.
If the model is learning, the accuracy increases. If the model is overfitting, instead, the accuracy stops to increase and can even start to decrease.
If the loss decreases and the accuracy decreases, your model is overfitting.
If the loss increases and the accuracy increase too is because your regularization techniques are working well and you're fighting the overfitting problem. This is true only if the loss, then, starts to decrease whilst the accuracy continues to increase.
Otherwise, if the loss keep growing your model is diverging and you should look for the cause (usually you're using a too high learning rate value).
I think the top-rated answer is incorrect.
I will assume you are talking about cross-entropy loss, which can be thought of as a measure of 'surprise'.
Loss and accuracy increasing/decreasing simultaneously on the training data tells you nothing about whether your model is overfitting. This can only be determined by comparing loss/accuracy on the validation vs. training data.
If loss and accuracy are both decreasing, it means your model is becoming more confident on its correct predictions, or less confident on its incorrect predictions, or both, hence decreased loss. However, it is also making more incorrect predictions overall, hence the drop in accuracy. Vice versa if both are increasing. That is all we can say.
I'd like to add a possible option here for all those who struggle with a model training right now.
If your validation data is a bit dirty, you might experience that in the beginning of the training the validation loss is low as well as the accuracy, and the more you train your network, the accuracy increases with the loss side by side. The reason why it happens, because it finds the possible outliers of your dirty data and gets a super high loss there. Therefore, your accuracy will grow as it guesses more data right, but the loss grows with it.
This is just what I think based on the math behind the loss and the accuracy,
Note :-
I expect your data is categorical
Your models output :-
[0.1,0.9,0.9009,0.8] (used to calculate loss)
Maxed output :-
[0,0,1,0] (used to calculate acc )
Expected output :-
[0,1,0,0]
Lets clarify what loss and acc calculates :
Loss :- The overall error of y and ypred
Acc :- Just if y and maxed(ypred) is equal
So in a overall our model almost nailed it , resulting in a low loss
But in maxed output no overall is seen its just that they should completely match ,
If they completely match :-
1
else:
0
Thus resulting in a low accuracy too
Try to check mae of the model
remove regularization
check if your are using correct loss
You should check your class index (both train and valid) in training process. It might be sorted in different ways. I have this problem in colab.