Testing Unsupervised Machine Learning Algorithms - testing

All over the Internet, I can see applications of supervised and unsupervised Machine Learning Algorithms but no one is talking about maintaining the quality of machine learning apps.
The recent analysis on how to test unsupervised machine learning algorithms brought up these points:
1) Cross-validation Testing: Dataset is divided into equal folds(parts) and all folds except the one are used as training dataset and later is used as test dataset Few more options around using test and training dataset.
Are there more effective ways of testing unsupervised ML algorithms where the output is uncertain?

Depending on the type of algorithm (and choosen distance) you used, you can still try to look if the variance between group and the variance within group is changing a lot.
If your algorithm is still as good as when you built it, the variance between and the variance within should not change that much. If the variance between shrink (or the reverse), it mean groups are not as well separated as before by your algorithm.
The second thing you can try is to keep some observations (that you know were well classified) to see if they are still in the same group once you retrained your algorithm. If not, it doesn't mean your algorithm is wrong, but you can send an alert in this case to look deeper.

Related

Deep learning basic thoughts

I try to understand the basics of deep learning, lastly reading a bit through deeplearning4j. However, I don't really find an answer for: How does the training performance scale with the amount of training data?
Apparently, the cost function always depends on all the training data, since it just sums the squared error per input. Thus, I guess at each optimization step, all datapoints have to be taken into account. I mean deeplearning4j has the dataset iterator and the INDArray, where the data can live anywhere and thus (I think) doesn't limit the amount of training data. Still, doesn't that mean, that the amount of training data is directly related to the calculation time per step within the gradient descend?
DL4J uses iterator. Keras uses generator. Still the same idea - your data comes in batches, and used for SGD. So, minibatches matter, not the the whole amount of data you have.
Fundamentally speaking it doesn't (though your mileage may vary). You must research right architecture for your problem. Adding new data records may introduce some new features, which may be hard to capture with your current architecture. I'd safely always question my net's capacity. Retrain your model and check if metrics drop.

Can predictions be trusted if learning curve shows validation error lower than training error?

I'm working with neural networks (NN) as a part of my thesis in geophysics, and is using TensorFlow with Keras for training my network.
My current task is to use a NN to approximate a thermodynamical model i.e a nonlinear regression problem. It takes 13 input parameters and outputs a velocity profile (velocity vs. depth) of 450 parameters. My data consists of 100,000 synthetic examples (i.e. no noise is present), split in training (80k), validation (10k) and testing (10k).
I've tested my network for a number of different architectures: wider (5-800 neurons) and deeper (up to 10 layers), different learning rates and batch sizes, and even for many epochs (5000). Basically all the standard tricks of the trade...
But, I am puzzled by the fact that the learning curve shows validation error lower than training error (for all my tests), and I've never been able to overfit to the training data. See figure below:
The error on the test set is correspondingly low, thus the network seems to be able to make decent predictions. It seems like a single hidden layer of 50 neurons is sufficient. However, I'm not sure if I can trust these results due to the behavior of the learning curve. I've considered that this might be due to the validation set consisting of examples that are "easy" to predict, but I cannot see how I should change this. A bigger validation set perhaps?
To wrap it up: Is is necessarily a bad sign if the validation error is lower than or very close to the training error? What if the predictions made with said network are decent?
Is it possible that overfitting is simply not possible for my problem and data?
In addition to trying a higher k fold and the additional testing holdout sample,perhaps mix it up when sampling from the original data set: Select a stratified sample when partitioning out the training and validation/test sets. Then partition the validation and test set without stratifying the sampling.
My opinion is that if you introduce more variation in your modeling methodology (without breaking any "statistical rules"), you can be more confident in the model that you have created.
You can achieve more trustworthy results by repeating your experiments on different data. Use cross validation with high fold (like k=10) to get better confidence of your solution performance. Usually neural networks easily overfit, if your solution has similar results on validation and test set its a good sign.
It is not that easy to tell when not knowing the exact way you have setup the experiment:
what cross-validation method did you use?
how did you split the data?
etc
As you mentioned, the fact that you observe validation error lower than training can be a result of the fact that either the training dataset contains many "hard" cases to learn or the validation set contains many "easy" cases to predict.
However, since generally speaking training loss is expected to underestimate the validation, to me the specific model appear to have unpredictable/unknown fit (perform better in predicting the unknown that the known feels indeed weird).
In order to overcome this, I would start experimenting by reconsidering the data splitting strategy, adding more data if possible, or even change your performance metric.

Is it possible to train Neural Network with low amount of instances?

I have faced some problem when I needed to solve Regression Task and use as minimum instances as possible. When I tried to use Xgboost I had to feed 4 instances to get the reasonable result. But Multilayer Perceptron tuned to overcoming Regression problems has to take 20 instances, tried to change amount of neurons&layers but the answer is still 20 .Is it possible to do something to make Neural Network solve Resgression tasks with from 2 to 4 instances? if yes - explain please what should I do to succeed in it? Maybe there is some correlation between how much instances are needed to train and get reasonable results from Perceptron and how features are valuable inside dataset?
Thanks in advance for any help
With small numbers of samples, there are likely better methods to apply, Xgaboost definitely comes to mind as a method that does quite well at avoiding overfitting.
Neural networks tend to work well with larger numbers of samples. They often over fit to small datasets and underperform other algorithms.
There is, however, an active area of research in semi-supervised techniques using neural networks with large datasets of unlabeled data and small datasets of labeled samples.
Here's a paper to start you down that path, search on 'semi supervised learning'.
http://vdel.me.cmu.edu/publications/2011cgev/paper.pdf
Another area of interest to reduce overfitting in smaller datasets is in multi-task learning.
http://ruder.io/multi-task/
Multi task learning requires the network to achieve multiple target goals for a given input. Adding more requirements tends to reduce the space of solutions that the network can converge on and often achieves better results because of it. To say that differently: when multiple objectives are defined, the parameters necessary to do well at one task are often beneficial for the other task and vice versa.
Lastly, another area of open research is GANs and how they might be used in semi-supervised learning. No papers pop to the forefront of my mind on the subject just now, so I'll leave this mention as a footnote.

Neural network gives different output for same input

What are the potential reasons for a NN to output different values for the same input? Especially when there isn't any random or stochastic processes?
This is a very broad and general question, might be even too broad to even be on here, but there are several things you should know about neural networks:
They are NOT methods for finding one prefect optimal solution. A neural network usually learn examples that it is given and "figures out" a way to predict results reasonably well. Reasonable is relative, and for some models may mean 50% success and for others anything short of 99.9% will be considered failure.
They're outcome is very dependent on the data that was trained on. The order of data matters, and it's usually a good idea to shuffle data during training, but that can lead to wildly different results. Also, the quality of data matters - if the training data is very different in nature to the test data for example.
The best analogy of neural networks in computing is of course - the brain. Even with the same information and same basic underlying biology, we could all evolve different opinions on matters based on endless other variables. Same thing with computer learning to some extent.
Some types of neural networks use dropout layers, that are specifically designed to shut off random parts of the network during training. This should not affect the final prediction process, because for predictions that layer is usually set to allow all the parts of the network to operate, but if you are inputting data and telling the model it is "training" instead of asking it to predict, the results may vary significantly.
The sum of all this is just to say: The training of neural networks should be expected to yield different results from similar starting conditions, and so must be tested multiple times for every condition to determine what parts of it are inevitable and what parts are not.
It might be due to shuffling of data , If you want to use the same vector you should turn the shuffle argument off.
You should try disabling dropout. Dropout randomly sets the outputs of certain neurons to 0. This will mean that your output will be different each time.

Must each tensorflow batch contain a uniform distribution of the inputs for all expected classifications?

This is probably a newbie question but I'm trying to get my head around how training on small batches works.
Scenario -
For the mnist classification problem, let's say that we have a model with appropriate hyerparameters that allow training on 0-9 digits. If we feed it with a small batches of uniform distribution of inputs (that have more or less same numbers of all digits in each batch), it'll learn to classify as expected.
Now, imagine that instead of a uniform distribution, we trained the model on images containing only 1s so that the weights are adjusted until it works perfectly for 1s. And then we start training on images that contain only 2s. Note that only the inputs have changed, the model and everything else has stayed the same.
Question -
What does the training exclusively on 2s after the model was already trained exclusively on 1s do? Will it keep adjusting the weights till it has forgotten (so to say) all about 1s and is now classifying on 2s? Or will it still adjust the weights in a way that it remembers both 1s and 2s?
In other words, must each batch contain a uniform distribution of different classifications? Does retraining a trained model in Tensorflow overwrite previous trainings? If yes, if it is not possible to create small (< 256) batches that are sufficiently uniform, does it make sense to train on very large (>= 500-2000) batch sizes?
That is a good question without a clear answer. In general, the order and selection of training samples has a large impact on the performance of the trained net, in particular in respect to the generalization properties it shows.
The impact is so strong, actually, that selecting specific examples, and ordering them in a particular way to maximize performance of the net even constitutes a genuine research area called `curriculum learning'. See this research paper.
So back to your specific question: You should try different possibilities and evaluate each of them (which might actually be an interesting learning exercise anyways). I would expect uniformly distributed samples to generalize well over different categories; samples drawn from the original distribution to achieve the highest overall score (since, if you have 90% samples from one category A, getting 70% over all categories will perform worse than having 99% from category A and 0% everywhere else, in terms of total accuracy); other sample selection mechanisms will show different behavior.
An interesting reading about such questions is Bengio's 2012 paper Practical Recommendations for Gradient-Based Training of Deep
Architectures
There is a section about online learning where the distribution of training data is unknown. I quote from the original paper
It
means that online learners, when given a stream of
non-repetitive training data, really optimize (maybe
not in the optimal way, i.e., using a first-order gradient
technique) what we really care about: generalization
error.
The best practice though to figure out how your dataset behaves under different testing scenarios would be to try them both and get experimental results of how the distribution of the training data affects your generalization error.