I have a 2 layered Neural Network that I'm training on about 10000 features (genomic data) with about 100 samples in my data set. Now I realized that anytime I run my model (i.e. compile & fit) I get varying validation/testing accuracys even if I leave the train/test/validation split untouched. Sometimes its around 70% sometimes around 90%.
Due to the stochastic nature of the NN I anticipate some variation but could these strong fluctuations be a sign of something else?

The reason why you're seeing such a big instability with your validation accuracy is because your neural network is huge in comparison to the data you train it on.
Even with just 12 neurons per layer, you still have 12 * 10000 + 12 = 120012 parameters in your first layer. Now think about what the neural network does under the hood. It takes your 10000 inputs, it multiplies each input by some weight and then sums all these inputs. Now you provide it only 64 training examples on which the training algorithm is supposed to decide what are the correct input weights. Just based on intuition, from a purely combinatorial perspective there is going to be large amount of weight assignments that do well on your 64 training samples. And you have no guarantee that the training algorithm will pick such weight assignment that will also do well on your out-of-sample data.
Given neural network is able to represent a wide variety of functions (it's been proven that under certain assumptions it can approximate any function, that's called general approximation). To select the function you want you provide the training algorithm with data to constrain the space of all possible functions the network can represent to a subspace of functions that fit your data. However, such function is in no way guaranteed to represent the true underlying relationship between the input and the output. And especially if the number of parameters is larger than the number of samples (in this case by a few orders of magnitude), you're nearly guaranteed to see your network simply memorize the samples in your training data, simply because it has the capacity to do so and you haven't constrained it enough.
In other words, what you're seeing is overfitting. In NNs, the general rule of thumb is that you want at least a couple of times more samples than you have parameters (look in to the Hoeffding Inequality for theoretical rationale of this) and in effect the more samples you have, the less you're afraid of overfitting.
So here is a couple of possible solutions:
Use an algorithm that's more suitable for the case where you have high input dimension and low sample count, such as Kernel SVM (Support Vector Machine). With such a low sample count, it's quite possible that a Kernel SVM algorithm will achieve better and more consistent validation accuracy. (You can easily test this, they are available in the scikit-learn package, really easy to use)
If you insist on using NN - use regularization. Given the fact you already have working code, this will be easy, just add kernel_regularizer to all your layers, I would try both L1 and L2 regularization (probably separately). L1 regularization tends to push weights to zero so it might help reduce the number of parameters in your problem. L2 just tries to make all the weights small. Use your validation set to decide the best value for each regularization. You can optimize both for the best mean accuracy and also the lowest variance in accuracy on your validation data (do something like 20 training runs for each parameter value of L1 and L2 regularization, usually just trying different orders of magnitude is sufficient, e.g. 1e-4, 1e-3, 1e-2, 1e-1, 1, 1e1).
If most of your input features are not really predictive or if they are highly correlated, PCA (Principal Component Analysis) can be used to project your inputs into a much lower dimensional space (e.g. from 10000 to 20), where you'd have much smaller neural network (still I'd use L1 or L2 for regularization because even then you'd have more weights than training samples)
On a final note, the point of a testing set is to use it very sparsely (ideally only once). It should be the final reported metric after all your research and model tuning is done. You should not optimize any values on it. You should do all this on your validation set. To avoid overfitting on your validation set, look into k-fold cross validation.


Neural network immediately overfitting

I have a FFNN with 2 hidden layers for a regression task that overfits almost immediately (epoch 2-5, depending on # hidden units). (ReLU, Adam, MSE, same # hidden units per layer, tf.keras)
32 neurons:
128 neurons:
I will be tuning the number of hidden units, but to limit the search space I would like to know what the upper and lower bounds should be.
Afaik it is better to have a too large network and try to regularize via L2-reg or dropout than to lower the network's capacity -- because a larger network will have more local minima, but the actual loss value will be better.
Is there any point in trying to regularize (via e.g. dropout) a network that overfits from the get-go?
If so I suppose I could increase both bounds. If not I would lower them.
model = Sequential()
model.add(Dense(n_neurons, 'relu'))
model.add(Dense(n_neurons, 'relu'))
model.add(Dense(1, 'linear'))
model.compile('adam', 'mse')
Hyperparameter tuning is generally the hardest step in ML, In general we try different values randomly and evalute the model and choose those set of values which give the best performance.
Getting back to your question, You have a high varience problem (Good in training, bad in testing).
There are eight things you can do in order
Make sure your test and training distribution are same.
Make sure you shuffle and then split the data into two sets (test and train)
A good train:test split will be 105:15K
Use a deeper network with Dropout/L2 regularization.
Increase your training set size.
Try Early Stopping
Change your loss function
Change the network architecture (Switch to ConvNets, LSTM etc).
Depending on your computation power and time you can set a bound to the number of hidden units and hidden layers you can have.
because a larger network will have more local minima.
Nope, this is not quite true, in reality as the number of input dimension increases the chance of getting stuck into a local minima decreases. So We usually ignore the problem of local minima. It is very rare. The derivatives across all the dimensions in the working space must be zero for a local/global minima. Hence, it is highly unlikely in a typical model.
One more thing, I noticed you are using linear unit for last layer. I suggest you to go for ReLu instead. In general we do not need negative values in regression. It will reduce test/train error
Take this :
In MSE 1/2 * (y_true - y_prediction)^2
because y_prediction can be nagative value. The whole MSE term may blow up to large values as y_prediction gets highly negative or highly positive.
Using a ReLu for last layer makes sure that y_prediction is positive. Hence low error will be expected.
Let me try to substantiate some of the ideas here, referenced from Ian Goodfellow et. al. Deep Learning book which is available for free online:
Chapter 7: Regularization The most important point is data, one can and should avoid regularization if they have large amounts of data that best approximate the distribution. In you case, it looks like there might be a significant discrepancy between training and test data. You need to ensure the data is consistent.
Section 7.4: Data-augmentation With regards to data, Goodfellow talks about data-augmentation and inducing regularization by injecting noise (most likely Gaussian) which mathematically has the same effect. This noise works well with regression tasks as you limit the model from latching onto a single feature to overfit.
Section 7.8: Early Stopping is useful if you just want a model with the best test error. But again this only works if your data allows the training to infer the test data. If there is an immediate increase in test error the training would stop immediately.
Section 7.12: Dropout Just applying dropout to a regression model doesn't necessarily help. In fact "when extremely few labeled training examples are available, dropout is less effective". For classification, dropout forces the model to not rely on single features, but in regression all inputs might be required to compute a value rather than classify.
Chapter 11: Practicals emphasises the use of base models to ensure that the training task is not trivial. If a simple linear regression can achieve similar behaviour than you don't even have a training problem to begin with.
Bottom line is you can't just play with the model and hope for the best. Check the data, understand what is required and then apply the corresponding techniques. For more details read the book, it's very good. Your starting point should be a simple regression model, 1 layer, very few neurons and see what happens. Then incrementally experiment.

Cost function convergence in Tensorflow using softmax_cross_entropy_with_logits and "soft" labels/targets

I've found what is probably a rare case in Tensorflow, but I'm trying to train a classifier (linear or nonlinear) using KL divergence (cross entropy) as the cost function in Tensorflow, with soft targets/labels (labels that form a valid probability distribution but are not "hard" 1 or 0).
However it is clear (tell-tail signs) that something is definitely wrong. I've tried both linear and nonlinear (dense neural network) forms, but no matter what I always get the same final value for my loss function regardless of network architecture (even if I train only a bias). Also, the cost function converges extremely quickly (within like 20-30 iterations) using L-BFGS (a very reliable optimizer!). Another sign something is amiss is that I can't overfit the data, and the validation set appears to have exactly the same loss value as the training set. However, strangely I do see some improvements when I increase network architecture size and/or change regularization loss. The accuracy improves with this as well (although not to the point that I'm happy with it or it's as I expect).
It DOES work as expected when I use the exact same code but send in one-hot encoded labels (not soft targets). An example of the cost function from training taken from Tensorboard is shown below. Can someone pitch me some ideas?
Ahh my friend, you're problem is that with soft targets, especially ones that aren't close to 1 or zero, cross entropy loss doesn't change significantly as the algorithm improves. One thing that will help you understand this problem is to take an example from your training data and compute the entropy....then you will know what the lowest value your cost function can be. This may shed some light on your problem. So for one of your examples, let's say the targets are [0.39019628, 0.44301641, 0.16678731]. Well, using the formula for cross entropy
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
but then using the targets "y_" in place of the predicted probabilities "y" we arrive at the true entropy value of 1.0266190072458234. If you're predictions are just slightly off of target....lets say they are [0.39511779, 0.44509024, 0.15979198], then the cross entropy is 1.026805558049737.
Now, as with most difficult problems, it's not just one thing but a combination of things. The loss function is being implemented correctly, but you made the "mistake" of doing what you should do in 99.9% of cases when training deep learning algorithms....you used 32-bit floats. In this particular case though, you will run out of significant digits that a 32-bit float can represent well before you training algorithm converges to a nice result. If I use your exact same data and code but only change the data types to 64-bit floats though, you can see below that the results are much better -- your algorithm continues to train well out past 2000 iterations and you will see it reflected in your accuracy as well. In fact, you can see from the slope if 128 bit floating point was supported, you could continue training and probably see advantages from it. You wouldn't probably need that precision in your final weights and biases...just during training to support continuing optimization of the cost function.

In Stochastic Gradient Descent as the cost function is updated based on single training data , wont it lead to overfitting?

When we are dealing with Stochastic Gradient Descent, the cost function is updated based on single, random training data.
But this single entry may alter the weights to its favour and as the cost function is only dependent on that entry, the cost function might mislead us, as it isn't actually reducing the cost, but instead it is overfitting the particular entry. With the next entry, once again, the weights will be updated to favour this entry.
Won't it lead to over fitting? How do I go about resolving this issue?
The training data isn't random - SGD iterates over all the training points (either singly or in batches). Because the loss function is calculated for data batch (or individual training point), it can be thought of as a random draw from a distribution of gradient vectors in weight space that will not match exactly the global gradient of the loss function calculated over the entirety of the training data. A single step is absolutely "over-fit" to the batch / training point, but we only take a single step in that direction (moderated by the learning rate which is typically << 1). Then we move on to the next data point (or batch) and calculate a new gradient. There is a "recency" effect (data trained more recently effectively counts more), but this is moderated by small learning rates. In aggregate over many iterations, all of the training data are equally weighted.
By doing this over all of the data in turn, each individual backprop step is taking a small random (but not uncorrelated) step in weight space. Across many training iterations, the network may be able to find its way to very good solutions (not a lot of guarantees about global optimality, but neural networks are highly expressive by their nature and can often find very good solutions). However, it may take many stepwise iterations over the same data set to converge to a local basin of attraction.
Over-fitting on training data is absolutely a concern for Neural Networks, but that's a function of their expressivity rather than the Stochastic Gradient Descent algorithm. Techniques like dropout and kernel regularizers on the training weights can provide regularization robustness, but the only way to

Tensorflow: Increasing number of duplicate predictions while training

I have a multilayer perceptron with 5 hidden layers and 256 neurons each. When I start training, I get different prediction probabilities for each train sample until epoch 50, but then the number of duplicate predictions increases, on epoch 300 I already have 30% of duplicate predictions which does not make sense since the input data is different for all training samples. Any idea what causes this behavior?
with "duplicate predictions", I mean items with the exactly same predicted probability to belong to class A (it's a binary classification problem)
I have 4000 training samples with 200 features each and all samples are different, it does not make sense that the number of duplicate predictions increases to 30% while training. So I wonder what can cause this behavior.
One point, you say you are doing a binary prediction, and when you say "duplicate predictions", even with your clarification it's hard to understand your meaning. I am guessing that you have two outputs for your binary classifier, one for class A and one for class B and you are getting roughly the same value for a given sample. If that's the case, then the first thing to do is to use 1 output. A binary classification problem is better modeled with 1 output that ranges between 0 and 1 (sigmoid the output neuron). This way there will be no ambiguity, the network will have to choose one or the other, or when it's confused you'll get ~0.5 and it will be clear.
Second, it is very common for a network to start learning well and then to perform more poorly after overtraining. Especially with small datasets such as what you have. In fact, even with the little knowledge I have of your dataset I would put a small bet on you getting better performance out of an algorithm like XGA Boost than a neural network (I assume you're using a neural net and not literally a perceptron).
But regarding the performance degrading over time. When this happens you want to look into something called "early stopping". At some point the network will start memorizing the input, and may be part of what's happening. Essentially you train until the performance on your held out test data starts to worsen.
To address this you can apply various forms of regularization (L2 regularization, dropout, batch normalization all come to mind). You can also reduce the size of your network. 5 layers of 256 neurons sounds too big for the problem. Try trimming this down and I bet your results will improve. There is a sweet spot for architecture size in neural networks. When your network is too large it can, and often will, over fit. When it's too small it won't be expressive enough for the data. Angrew Ng's coursera class has some helpful practical advice on dealing with this.

Multi GPU architecture, gradient averaging - less accurate model?

When I execute the cifar10 model as described at https://www.tensorflow.org/tutorials/deep_cnn I achieve 86% accuracy after approx 4 hours using a single GPU , when I utilize 2 GPU's the accuracy drops to 84% but reaching 84% accuracy is faster on 2 GPU's than 1.
My intuition is
that average_gradients function as defined at https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py returns a less accurate gradient value as an average of gradients will be less accurate than the actual gradient value.
If the gradients are less accurate then the parameters than control the function that is learned as part of training is less accurate. Looking at the code (https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py) why is averaging the gradients over multiple GPU's less accurate than computing the gradient on a single GPU ?
Is my intuition of averaging the gradients producing a less accurate value correct ?
Randomness in the model is described as :
The images are processed as follows:
They are cropped to 24 x 24 pixels, centrally for evaluation or randomly for training.
They are approximately whitened to make the model insensitive to dynamic range.
For training, we additionally apply a series of random distortions to artificially increase the data set size:
Randomly flip the image from left to right.
Randomly distort the image brightness.
Randomly distort the image contrast.
src : https://www.tensorflow.org/tutorials/deep_cnn
Does this have an effect on training accuracy ?
Update :
Attempting to investigate this further, the loss function value training with different number of GPU's.
Training with 1 GPU : loss value : .7 , Accuracy : 86%
Training with 2 GPU's : loss value : .5 , Accuracy : 84%
Shouldn't the loss value be lower for higher for higher accuracy, not vice versa ?
In the code you linked, using the function average_gradient with 2 GPUs is exactly equivalent (1) to simply using 1 GPU with twice the batch size.
You can see it in the definition:
grad = tf.concat(axis=0, values=grads)
grad = tf.reduce_mean(grad, 0)
Using a larger batch size (given the same number of epochs) can have any kind of effect on your results.
Therefore, if you want to do exactly equivalent (1) calculations in 1-GPU or 2-GPU cases, you may want to halve the batch size in the latter case. (People sometimes avoid doing it, because smaller batch sizes may also make the computation on each GPU slower, in some cases)
Additionally, one needs to be careful with learning rate decay here. If you use it, you want to make sure the learning rate is the same in the nth epoch in both 1-GPU and 2-GPU cases -- I'm not entirely sure this code is doing the right thing here. I tend to print the learning rate in the logs, something like
print sess.run(lr)
should work here.
(1) Ignoring issues related to pseudo-random numbers, finite precision or data set sizes not divisible by the batch size.
There is a decent discussion of this here (not my content). Basically when you distribute SGD, you have to communicate gradients back and forth somehow between workers. This is inherently imperfect, and so your distributed SGD typically diverges from a sequential, single-worker SGD at least to some degree. It is also typically faster, so there is a trade off.
[Zhang et. al., 2015] proposes one method for distributed SGD called elastic-averaged SGD. The paper goes through a stability analysis characterizing the behavior of the gradients under different communication constraints. It gets a little heavy, but it might shed some light on why you see this behavior.
Edit: regarding whether the loss should be lower for the higher accuracy, it is going to depend on a couple of things. First, I am assuming that you are using softmax cross-entropy for your loss (as stated in the deep_cnn tutorial you linked), and assuming accuracy is the total number of correct predictions divided by the total number of samples. In this case, a lower loss on the same dataset should correlate to a higher accuracy. The emphasis is important.
If you are reporting loss during training but then report accuracy on your validation (or testing) dataset, it is possible for these two to be only loosely correlated. This is because the model is fitting (minimizing loss) to a certain subset of your total samples throughout the training process, and then tests against new samples that it has never seen before to verify that it generalizes well. The loss against this testing/validation set could be (and probably is) higher than the loss against the training set, so if the two numbers are being reported from different sets, you may not be able to draw comparisons like "loss for 1 GPU case should be lower since its accuracy is lower".
Second, if you are distributing the training then you are calculating losses across multiple workers (I believe), but only one accuracy at the end, again against a testing or validation set. Maybe the loss being reported is the best loss seen by any one worker, but overall the average losses were higher.
Basically I do not think we have enough information to decisively say why the loss and accuracy do not seem to correlate the way you expect, but there are a number of ways this could be happening, so I wouldn't dismiss it out of hand.
I've also encountered this issue.
See Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour from Facebook which addresses the same issue. The suggested solution is simply to scale up the learning rate by k (after some reasonable warm-up epochs) for k GPUs.
In practice I've found out that simply summing up the gradients from the GPUs (rather than averaging them) and using the original learning rate sometimes does the job as well.