Feature rescaling for k-means clustering - k-means

Is it correct to only rescale a large int feature and leave the % features as is when running k-means algorithm for segmentation analysis ?
e.g. a feature is the population of the city(150 000), then the rest features are % (e.g. 0.46) of energy consumption etc etc.

Related

Loss function for binary classification with problem of data imbalance

I try to segment of multiple sclerosis lesions in MR images using deep convolutional neural networks with keras. In this task, each voxel must be classified, either as a lesion voxel or healthy voxel.
The challenge of this task is data imbalance that number of lesion voxels is less than number of healthy voxels and data is extremely imbalanced.
I have a small number of training data and I can not use the sampling techniques. I try to select appropriate loss function to classify voxels in these images.
I tested focal loss, but I could not tuning gamma parameter in this loss function.
Maybe someone help me that how to select appropriate loss function for this task?
Focal loss is indeed a good choice, and it is difficult to tune it to work.
I would recommend using online hard negative mining: At each iteration, after your forward pass, you have loss computed per voxel. Before you compute gradients, sort the "healthy" voxels by their loss (high to low), and set to zero the loss for all healthy voxels apart from the worse k (where k is about 3 times the number of "lesion" voxels in the batch).
This way, gradients will only be estimated for a roughly balanced set.
This video provides a detailed explanation how class imbalance negatively affect training, and how to use online hard negative mining to overcome it.

Multiple questions regarding the KL term in the ELBO loss with TensorFlow Probability

I have been trying to conduct a few experiments using TensorFlow Probability (TFP), and I got a few questions.
What is the proper value of the coefficient of the KL loss?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
The current coefficient only takes care of the number of training samples, but not the network complexity or number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
Is batch normalization or group normalization still helpful in Bayesian deep learning?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
In the BBB paper eq. 8, they refer to M being the number of mini-batches. To be consistent with the non-stochastic gradient learning, it should be scaled by the number of mini-batches which is what is done by Graves. Another alternative is that done in eq. 9, where they scale it by \pi_i, where the sum of all the values in the set {\pi} sum to one.
In the TFP example, it does look like the num_examples is the total number of independent samples within the training set, which is much larger than the number of batches. This is goes by a few names, such as Safe Bayes or Tempering. Have a look at sec. 8 of this paper for some more discussion about the use of tempering within Bayesian inference and it's suitability.
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
The ELBO will always be larger than just your cross-entropy (which defines your likelihood). Have a look at how the KL divergence term in the ELBO is found. (and a full mean-field approach where each weight/parameter is assumed to be independent).
Since the assumed posterior is factorised (assume each parameter is independent), can write the joint distribution as a product. This means when you take the log when you are computing the KL between the approx. posterior and the prior, you can write it as a sum of the KL terms between each parameter. Since the KL is >= 0, for each parameter you add to your model you will be adding another positive term to your ELBO. This is likely why your loss is so much more for your 3D model, likely because there is more parameters.
Another reason this could occur is if you have less data (your M is smaller, than the KL term is weighted less).
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
I am unsure of any specific guideline, for training you are interested primarily in the gradients. A large loss does not mean a large gradient. Have a look at the gradients contributed by the negative log likelihood and the KL term in your ELBO. If the KL term is too large, you probably need a more informative prior or more data (you could simply scale the KL term but this feels a bit yucky for the Bayesian in me).
The current coefficient only takes care of the number of training samples, but not the network complexity or the number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
Yes, as stated before, in general, more parameters == greater ELBO (for a mean-field approach as used in Bayes by Backprop).
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
I am unsure about the best way to tackle this part of it. I would be cautious about going to older versions where it isn't explicitly supported. They put those warnings/exceptions in for a reason.
Is batch normalization or group normalization still helpful in Bayesian deep learning?
For variational inference (as done in Bayes by Backprop) Batchnorm is fine. For sampling methods such as MCMC, Batch normalization is no longer suitable. Have a look at https://arxiv.org/pdf/1908.03491v1.pdf for info on suitability for batch norm with sampling methods for approx. Bayesian inference.

What is the impact from choosing auc/error/logloss as eval_metric for XGBoost binary classification problems?

How does choosing auc, error, or logloss as the eval_metric for XGBoost impact its performance? Assume data are unbalanced. How does it impact accuracy, recall, and precision?
Choosing between different evaluation matrices doesn't directly impact the performance. Evaluation matrices are there for the user to evaluate his model. accuracy is another evaluation method, and so does precision-recall. On the other hand, Objective functions is what impacts all those evaluation matrices
For example, if one classifier is yielding a probability of 0.7 for label 1 and 0.3 for label 0, and a different classifier is yielding a probability of 0.9 for label 1 and 0.1 for label 0 you will have a different error between them, though both of them will classify the labels correctly.
Personally, most of the times, I use roc auc to evaluate a binary classification, and if I want to look deeper, I look at a confusion matrix.
When dealing with unbalanced data, one needs to know how much unbalanced, is it 30% - 70% ratio or 0.1% - 99.9% ratio? I've read an article talking about how precision recall is a better evaluation for highly unbalanced data.
Here some more reading material:
Handling highly imbalance classes and why Receiver Operating Characteristics Curve (ROC Curve) should not be used, and Precision/Recall curve should be preferred in highly imbalanced situations
ROC and precision-recall with imbalanced datasets
The only way evaluation metric can impact your model accuracy (or other different eval matrices) is when using early_stopping. early_stopping decides when to stop train additional boosters according to your evaluation metric.
early_stopping was designed to prevent over-fitting.

How can I evaluate FaceNet embeddings for face verification on LFW?

I am trying to create a script that is able to evaluate a model on lfw dataset. As a process, I am reading pair of images (using the LFW annotation list), track and crop the face, align it and pass it through a pre-trained facenet model (.pb using tensorflow) and extract the features. The feature vector size = (1,128) and the input image is (160,160).
To evaluate for the verification task, I am using a Siamese architecture. That is, I am passing a pair of images (same or different person) from two identical models ([2 x facenet] , this is equivalent like passing a batch of images with size 2 from a single network) and calculating the euclidean distance of the embeddings. Finally, I am training a linear SVM classifier to extract 0 when the embedding distance is small and 1 otherwise using pair labels. This way I am trying to learn a threshold to be used while testing.
Using this architecture I am getting a score of 60% maximum. On the other hand, using the same architecture on other models (e.g vgg-face), where the features are 4096 [fc7:0] (not embeddings) I am getting 90%. I definitely cannot replicate the scores that I see online (99.x%), but using the embeddings the score is very low. Is there something wrong with the pipeline in general ?? How can I evaluate the embeddings for verification?
Nevermind, the approach is correct, facenet model that is available online is poorly trained and that is the reason for the poor score. Since this model is trained on another dataset and not the original one that is described in the paper (obviously), verification score will be less than expected. However, if you set a constant threshold to the desired value you can probably increase true positives but by sacrificing f1 score.
You can use a similarity search engine. Either using approximated kNN search libraries such as Faiss or Nmslib, cloud-ready similarity search open-source tools such as Milvus, or production-ready managed service such as Pinecone.io.

Multi GPU architecture, gradient averaging - less accurate model?

When I execute the cifar10 model as described at https://www.tensorflow.org/tutorials/deep_cnn I achieve 86% accuracy after approx 4 hours using a single GPU , when I utilize 2 GPU's the accuracy drops to 84% but reaching 84% accuracy is faster on 2 GPU's than 1.
My intuition is
that average_gradients function as defined at https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py returns a less accurate gradient value as an average of gradients will be less accurate than the actual gradient value.
If the gradients are less accurate then the parameters than control the function that is learned as part of training is less accurate. Looking at the code (https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py) why is averaging the gradients over multiple GPU's less accurate than computing the gradient on a single GPU ?
Is my intuition of averaging the gradients producing a less accurate value correct ?
Randomness in the model is described as :
The images are processed as follows:
They are cropped to 24 x 24 pixels, centrally for evaluation or randomly for training.
They are approximately whitened to make the model insensitive to dynamic range.
For training, we additionally apply a series of random distortions to artificially increase the data set size:
Randomly flip the image from left to right.
Randomly distort the image brightness.
Randomly distort the image contrast.
src : https://www.tensorflow.org/tutorials/deep_cnn
Does this have an effect on training accuracy ?
Update :
Attempting to investigate this further, the loss function value training with different number of GPU's.
Training with 1 GPU : loss value : .7 , Accuracy : 86%
Training with 2 GPU's : loss value : .5 , Accuracy : 84%
Shouldn't the loss value be lower for higher for higher accuracy, not vice versa ?
In the code you linked, using the function average_gradient with 2 GPUs is exactly equivalent (1) to simply using 1 GPU with twice the batch size.
You can see it in the definition:
grad = tf.concat(axis=0, values=grads)
grad = tf.reduce_mean(grad, 0)
Using a larger batch size (given the same number of epochs) can have any kind of effect on your results.
Therefore, if you want to do exactly equivalent (1) calculations in 1-GPU or 2-GPU cases, you may want to halve the batch size in the latter case. (People sometimes avoid doing it, because smaller batch sizes may also make the computation on each GPU slower, in some cases)
Additionally, one needs to be careful with learning rate decay here. If you use it, you want to make sure the learning rate is the same in the nth epoch in both 1-GPU and 2-GPU cases -- I'm not entirely sure this code is doing the right thing here. I tend to print the learning rate in the logs, something like
print sess.run(lr)
should work here.
(1) Ignoring issues related to pseudo-random numbers, finite precision or data set sizes not divisible by the batch size.
There is a decent discussion of this here (not my content). Basically when you distribute SGD, you have to communicate gradients back and forth somehow between workers. This is inherently imperfect, and so your distributed SGD typically diverges from a sequential, single-worker SGD at least to some degree. It is also typically faster, so there is a trade off.
[Zhang et. al., 2015] proposes one method for distributed SGD called elastic-averaged SGD. The paper goes through a stability analysis characterizing the behavior of the gradients under different communication constraints. It gets a little heavy, but it might shed some light on why you see this behavior.
Edit: regarding whether the loss should be lower for the higher accuracy, it is going to depend on a couple of things. First, I am assuming that you are using softmax cross-entropy for your loss (as stated in the deep_cnn tutorial you linked), and assuming accuracy is the total number of correct predictions divided by the total number of samples. In this case, a lower loss on the same dataset should correlate to a higher accuracy. The emphasis is important.
If you are reporting loss during training but then report accuracy on your validation (or testing) dataset, it is possible for these two to be only loosely correlated. This is because the model is fitting (minimizing loss) to a certain subset of your total samples throughout the training process, and then tests against new samples that it has never seen before to verify that it generalizes well. The loss against this testing/validation set could be (and probably is) higher than the loss against the training set, so if the two numbers are being reported from different sets, you may not be able to draw comparisons like "loss for 1 GPU case should be lower since its accuracy is lower".
Second, if you are distributing the training then you are calculating losses across multiple workers (I believe), but only one accuracy at the end, again against a testing or validation set. Maybe the loss being reported is the best loss seen by any one worker, but overall the average losses were higher.
Basically I do not think we have enough information to decisively say why the loss and accuracy do not seem to correlate the way you expect, but there are a number of ways this could be happening, so I wouldn't dismiss it out of hand.
I've also encountered this issue.
See Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour from Facebook which addresses the same issue. The suggested solution is simply to scale up the learning rate by k (after some reasonable warm-up epochs) for k GPUs.
In practice I've found out that simply summing up the gradients from the GPUs (rather than averaging them) and using the original learning rate sometimes does the job as well.