One class classification - interpreting the models accuracy - libsvm

I am using LIBSVM for classification of data. I am mainly doing One Class Classification.
My training sets consists of data of only one class & my testing data consists of data of two classes (one which belong to target class & the other which doesn't belong to the target class).
After applying svmtrain and svmpredict on both training and testing datasets the accuracy which is coming for training sets is 48% and for testing sets it is 34.72%.
Is it good? How can I know whether LIBSVM is classifying the datasets correctly?

To say if it is good or not depends entirely on the data you are trying to classify. You should search what is the state of the art accuracy for SVM model for your kind of classification and then you will be able to know if your model is good or not.
What I can say from your results is that the testing accuracy is worse than the training accuracy, which is normal as a classifier usually perform better with data it has already seen before.
What you can try now is to play with the regularization parameter (C if you are using a linear kernel) and see if the performance improves on the testing set.
You can also trace learning curves to see if your classifier overfit or not, which will help you choose if you need to increase or decrease the regularization.
For you case, you might want to apply weighting on the classes as the data is often sparse in favor of negative example.
To know whether Libsvm is classifying the dataset correctly you can look at which examples it predicted correctly and which ones it predicted incorrectly. Then you can try to change your features to improve its results.
If you are worried about your code being correct, you can try to code a toy example and play with it or use an example of someone on the web and replicate their results.

Related

Can predictions be trusted if learning curve shows validation error lower than training error?

I'm working with neural networks (NN) as a part of my thesis in geophysics, and is using TensorFlow with Keras for training my network.
My current task is to use a NN to approximate a thermodynamical model i.e a nonlinear regression problem. It takes 13 input parameters and outputs a velocity profile (velocity vs. depth) of 450 parameters. My data consists of 100,000 synthetic examples (i.e. no noise is present), split in training (80k), validation (10k) and testing (10k).
I've tested my network for a number of different architectures: wider (5-800 neurons) and deeper (up to 10 layers), different learning rates and batch sizes, and even for many epochs (5000). Basically all the standard tricks of the trade...
But, I am puzzled by the fact that the learning curve shows validation error lower than training error (for all my tests), and I've never been able to overfit to the training data. See figure below:
The error on the test set is correspondingly low, thus the network seems to be able to make decent predictions. It seems like a single hidden layer of 50 neurons is sufficient. However, I'm not sure if I can trust these results due to the behavior of the learning curve. I've considered that this might be due to the validation set consisting of examples that are "easy" to predict, but I cannot see how I should change this. A bigger validation set perhaps?
To wrap it up: Is is necessarily a bad sign if the validation error is lower than or very close to the training error? What if the predictions made with said network are decent?
Is it possible that overfitting is simply not possible for my problem and data?
In addition to trying a higher k fold and the additional testing holdout sample,perhaps mix it up when sampling from the original data set: Select a stratified sample when partitioning out the training and validation/test sets. Then partition the validation and test set without stratifying the sampling.
My opinion is that if you introduce more variation in your modeling methodology (without breaking any "statistical rules"), you can be more confident in the model that you have created.
You can achieve more trustworthy results by repeating your experiments on different data. Use cross validation with high fold (like k=10) to get better confidence of your solution performance. Usually neural networks easily overfit, if your solution has similar results on validation and test set its a good sign.
It is not that easy to tell when not knowing the exact way you have setup the experiment:
what cross-validation method did you use?
how did you split the data?
etc
As you mentioned, the fact that you observe validation error lower than training can be a result of the fact that either the training dataset contains many "hard" cases to learn or the validation set contains many "easy" cases to predict.
However, since generally speaking training loss is expected to underestimate the validation, to me the specific model appear to have unpredictable/unknown fit (perform better in predicting the unknown that the known feels indeed weird).
In order to overcome this, I would start experimenting by reconsidering the data splitting strategy, adding more data if possible, or even change your performance metric.

Is it possible to train a CNN on a dataset and test it on another dataset with different classes?

I am new to deep learning, and I am doing a research using CNNs. I need to train a CNN model on a dataset of images (landmark images) and test the same model using a different dataset (landmark images too). One of the motivations is to see the ability of the model to generalize. But the problems is: Since the dataset used for train and test is not the same, the classes are not the same! Possibly, the number of classes too, which means that the predictions made on the test dataset are not trust worthy (Since the weights of the output layer have been calculated based on different classes belonging to train dataset). Is there any way to evaluate a model on a different dataset without affecting test accuracy?
The performance of a neural network on one dataset will not generally be the same as its performance on another. Images in one dataset can be more difficult to distinguish than those in another. As a rule of thumb: if your landmark datasets are similar, it's likely that performance will be similar. However, this is not always the case: subtle differences between the datasets can result in significantly different performance.
You can account for the potentially different performance on the two datasets by training another network on the other dataset. This will give you a baseline of what to expect when you try to generalize your network to it.
You can apply your neural network trained for one set of classes to another set of classes. There are two main approaches to this:
Transfer learning. This is where the last layer of your trained network is replaced with a new layer(s) that is trained, by itself, to classify the new images. (Use for many classes. Can use for few classes.)
All-Transfer learning. Rather than replacing the last layer, add a new layer after it and only train the final layers. (Use for few classes.)
Both approaches are much quicker than training a neural network from scratch.
I assume that you are facing a classification problem.
What do you explicitly mean? Do you have classes A B and C in your train-dataset and the same classes in your test-dataset with a different labeling, or do you have completly different classes in your test-dataset with respect to your train-dataset?
You can solve the first problem by creating a mapping from trainlabel to testlabel or vice versa.
The second one depends on what you are trying to achieve... If you want the model to predict classes, which were never trained, you wont get any outcome.

How to remove "glitches" in the loss graph of training phase?

When training a deep learning model, I noticed that the training loss was little weird. There were some "glitches" at certain epochs, as seen in the figure below.
Please let me know the reasons and how to get rid of them?
Thank you
This may be completely normal and due to how the learning process works.
In practice, since using stochastic gradient descend (SGD) you optimize the loss function by approximating the whole-dataset loss landscape with the current minibatch loss landscape, the optimization process becomes noisy and peaked.
In fact, in each iteration, you evaluate the loss obtained by the model on the current minibatch and then update the model parameters based on this loss. However, this loss value is not necessarily the value you would have obtained by making a prediction on the whole dataset. For example, in a binary classification problem, imagine what happens if due to randomness your current minibatch contains only samples of one class A instead of both the classes A and B: the current loss doesn't take into account the class B and you will update the model parameters only based on the results of one class (A). As a consequence, if the next minibatch had to contain an equal number of samples of the classes A and B your results will be worst than usual.
Even if the unbalance among classes is usually addressed by using balanced minibatches or weighted loss functions, more in general you must think that what I described may happen also inside one class. Suppose you have a large heterogeneity inside the class A: your model could update the parameters more based on certain features than others.
For more theoretical aspects, which I really encourage you to read, you can have a look at this:
http://ruder.io/optimizing-gradient-descent/

CNN : Fine tuning small network vs feature extracting from a big network

To elaborate : Under what circumstances would fine tuning all layers of a small network (say SqueezeNet) perform better than feature extracting or fine tuning only last 1 or 2 Convolution layer of a big network (e.g inceptionV4)?
My understanding is computing resource required for both is somewhat comparable. And I remember reading in a paper that extreme options i.e fine tuning 90% or 10% of network is far better compared to more moderate like 50%. So, what should be the default choice when experimenting extensively is not an option?
Any past experiments and intuitive description of their result, research paper or blog would be specially helpful. Thanks.
I don't have much experience in training models like SqueezeNet, but I think it is much easier to finetune only the last 1 or 2 layers of a big network: you don't have to extensively search for many optimal hyperparameters. Transfer learning works amazingly well out of the box with the LR finder and the cyclical learning rate from fast.ai.
If you want fast inference after the training, then it is preferable to train SqueezeNet. It might also be the case if the new task is very different from ImageNet.
Some intuition from http://cs231n.github.io/transfer-learning/
New dataset is small and similar to original dataset. Since the data is small, it is not a good idea to fine-tune the ConvNet due to overfitting concerns. Since the data is similar to the original data, we expect higher-level features in the ConvNet to be relevant to this dataset as well. Hence, the best idea might be to train a linear classifier on the CNN codes.
New dataset is large and similar to the original dataset. Since we have more data, we can have more confidence that we won’t overfit if we were to try to fine-tune through the full network.
New dataset is small but very different from the original dataset. Since the data is small, it is likely best to only train a linear classifier. Since the dataset is very different, it might not be best to train the classifier form the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier from activations somewhere earlier in the network.
New dataset is large and very different from the original dataset. Since the dataset is very large, we may expect that we can afford to train a ConvNet from scratch. However, in practice it is very often still beneficial to initialize with weights from a pretrained model. In this case, we would have enough data and confidence to fine-tune through the entire network.

Must each tensorflow batch contain a uniform distribution of the inputs for all expected classifications?

This is probably a newbie question but I'm trying to get my head around how training on small batches works.
Scenario -
For the mnist classification problem, let's say that we have a model with appropriate hyerparameters that allow training on 0-9 digits. If we feed it with a small batches of uniform distribution of inputs (that have more or less same numbers of all digits in each batch), it'll learn to classify as expected.
Now, imagine that instead of a uniform distribution, we trained the model on images containing only 1s so that the weights are adjusted until it works perfectly for 1s. And then we start training on images that contain only 2s. Note that only the inputs have changed, the model and everything else has stayed the same.
Question -
What does the training exclusively on 2s after the model was already trained exclusively on 1s do? Will it keep adjusting the weights till it has forgotten (so to say) all about 1s and is now classifying on 2s? Or will it still adjust the weights in a way that it remembers both 1s and 2s?
In other words, must each batch contain a uniform distribution of different classifications? Does retraining a trained model in Tensorflow overwrite previous trainings? If yes, if it is not possible to create small (< 256) batches that are sufficiently uniform, does it make sense to train on very large (>= 500-2000) batch sizes?
That is a good question without a clear answer. In general, the order and selection of training samples has a large impact on the performance of the trained net, in particular in respect to the generalization properties it shows.
The impact is so strong, actually, that selecting specific examples, and ordering them in a particular way to maximize performance of the net even constitutes a genuine research area called `curriculum learning'. See this research paper.
So back to your specific question: You should try different possibilities and evaluate each of them (which might actually be an interesting learning exercise anyways). I would expect uniformly distributed samples to generalize well over different categories; samples drawn from the original distribution to achieve the highest overall score (since, if you have 90% samples from one category A, getting 70% over all categories will perform worse than having 99% from category A and 0% everywhere else, in terms of total accuracy); other sample selection mechanisms will show different behavior.
An interesting reading about such questions is Bengio's 2012 paper Practical Recommendations for Gradient-Based Training of Deep
Architectures
There is a section about online learning where the distribution of training data is unknown. I quote from the original paper
It
means that online learners, when given a stream of
non-repetitive training data, really optimize (maybe
not in the optimal way, i.e., using a first-order gradient
technique) what we really care about: generalization
error.
The best practice though to figure out how your dataset behaves under different testing scenarios would be to try them both and get experimental results of how the distribution of the training data affects your generalization error.