Keras NN producing results with good variability prediction but poor magnitude prediction - tensorflow

I'm currently using keras (tensorflow) to create a feed-forward neural network to predict a sales value. When I look at the test set comparing predicted vs real sales values a fit line ends up having a good R squared value but a low slope value. So it's predicting the variability in the data correctly, but not predicting the magnitude of the data. When I look at the data in smaller subsets, the underprediction is consistent. Has anybody had experience with this or have an idea what could cause this? I have a feeling it may be a data bias/normalization issue.


Multiple questions regarding the KL term in the ELBO loss with TensorFlow Probability

I have been trying to conduct a few experiments using TensorFlow Probability (TFP), and I got a few questions.
What is the proper value of the coefficient of the KL loss?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
The current coefficient only takes care of the number of training samples, but not the network complexity or number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
Is batch normalization or group normalization still helpful in Bayesian deep learning?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
In the BBB paper eq. 8, they refer to M being the number of mini-batches. To be consistent with the non-stochastic gradient learning, it should be scaled by the number of mini-batches which is what is done by Graves. Another alternative is that done in eq. 9, where they scale it by \pi_i, where the sum of all the values in the set {\pi} sum to one.
In the TFP example, it does look like the num_examples is the total number of independent samples within the training set, which is much larger than the number of batches. This is goes by a few names, such as Safe Bayes or Tempering. Have a look at sec. 8 of this paper for some more discussion about the use of tempering within Bayesian inference and it's suitability.
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
The ELBO will always be larger than just your cross-entropy (which defines your likelihood). Have a look at how the KL divergence term in the ELBO is found. (and a full mean-field approach where each weight/parameter is assumed to be independent).
Since the assumed posterior is factorised (assume each parameter is independent), can write the joint distribution as a product. This means when you take the log when you are computing the KL between the approx. posterior and the prior, you can write it as a sum of the KL terms between each parameter. Since the KL is >= 0, for each parameter you add to your model you will be adding another positive term to your ELBO. This is likely why your loss is so much more for your 3D model, likely because there is more parameters.
Another reason this could occur is if you have less data (your M is smaller, than the KL term is weighted less).
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
I am unsure of any specific guideline, for training you are interested primarily in the gradients. A large loss does not mean a large gradient. Have a look at the gradients contributed by the negative log likelihood and the KL term in your ELBO. If the KL term is too large, you probably need a more informative prior or more data (you could simply scale the KL term but this feels a bit yucky for the Bayesian in me).
The current coefficient only takes care of the number of training samples, but not the network complexity or the number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
Yes, as stated before, in general, more parameters == greater ELBO (for a mean-field approach as used in Bayes by Backprop).
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
I am unsure about the best way to tackle this part of it. I would be cautious about going to older versions where it isn't explicitly supported. They put those warnings/exceptions in for a reason.
Is batch normalization or group normalization still helpful in Bayesian deep learning?
For variational inference (as done in Bayes by Backprop) Batchnorm is fine. For sampling methods such as MCMC, Batch normalization is no longer suitable. Have a look at for info on suitability for batch norm with sampling methods for approx. Bayesian inference.

Once a CNN is trained, should its ouputs be deterministic?

I just trained a CNN with Tensorflow/Keras and saved it as a model. I tried running about 1000 inputs through it multiple times, and each time got a slightly different prediction accuracy. The accuracy was good, and I am not concerned with the performance; however, I thought that CNN models, once trained, should be deterministic. That is, any input will always be classified the same way. Is this not the case? Is there variability in the way a model can predict once trained? If not, hopefully I can assume that I have programmed some variability into my code unawares. Any help would be appreciated.
Once a CNN is trained, should its ouputs be deterministic?
Well, in theory, yes. In practise, as Peter Duniho points out in his excellent explanatory comment, we can see very small deviations because of the way values are calculated, aggregated, etc.
In practice the probability of such small deviations changing the predicted category (and therefore the accuracy) of a classification model are so small that I'd be almost certain something else is at play in your example. Even over a sample size of 1000.
Have you left on some training regularisation like batch normalisation? Are you certain you are evaluating precisely the same 1000 inputs each time? Got to suspect the issue is in the code rather than rounding errors.
Can you determine which specific classification changes?

Tensorflow bounded regression vs classification

As part of my masters thesis I have been tasked with predicting a label integer (0-255) which is a binned representation of an angle. The feature columns are also integers, in the range (0-255).
So far I have used the custom Tensorflow layers estimator, implementing a 256 output classifier which performs well. However, my issue with the classification approach I am using is the following:
My classification model thinks that predicting a 3 instead of a 28 is as good/bad as predicting a 27 as a 28
The numerical interval / ordinal nature of my data (not sure which) leads me to believe that if I used regression I would achieve results with less drastically incorrect predictions or outliers.
My goal:
to reduce the number of drastically incorrect predicted outliers
My questions:
Is regression the better approach, or can I improve my
classification to include an ordinal/interval relationship between
my labels?
If I choose regression, is there a way to bound my predicted output between 0-255 (I know I will have to round float values predicted).
Thanks in advance. Any other comments, suggestions or ideas to help me to best tackle the problem are also very helpful.
If I made any incorrect assumptions or mistake in my interpretation of the problem feel free to correct me.
Question 1: Regression is the simpler approach, however, you can also use classification and manipulate the loss function to have a lower loss for misclassifications that are "close" to the original class.
Question 2: The tensorflow command for bounding your prediction is tf.clip_by_value. Are you mapping all 360 degrees to [0,255]? In that case you will want to consider the boundary cases, i.e. your estimator yields -4 and the true value is 251, but they are the actually representing the same value so loss should be 0.

Is it possible to extract confidence values for regression predictions in tensorflow?

Can I extract the confidence values or variance in prediction error from a tensorflow regressor? e.g. if the model gives a prediction x, then can I know the confidence band, like is x in +-25% range of the actual value?
I'm afraid it's not as easy as when using sofmax in the output layer. As said in here you can use the MSE of the NN on the validation as an estimate for variance, then use your desired value of confidence. Be aware that this approach assumes a lot of things (ie. distribution of errors is allways the same which may not be true) so if you really need those confidence intervals, a regression NN is not the best fit for you.

Tensorflow: Increasing number of duplicate predictions while training

I have a multilayer perceptron with 5 hidden layers and 256 neurons each. When I start training, I get different prediction probabilities for each train sample until epoch 50, but then the number of duplicate predictions increases, on epoch 300 I already have 30% of duplicate predictions which does not make sense since the input data is different for all training samples. Any idea what causes this behavior?
with "duplicate predictions", I mean items with the exactly same predicted probability to belong to class A (it's a binary classification problem)
I have 4000 training samples with 200 features each and all samples are different, it does not make sense that the number of duplicate predictions increases to 30% while training. So I wonder what can cause this behavior.
One point, you say you are doing a binary prediction, and when you say "duplicate predictions", even with your clarification it's hard to understand your meaning. I am guessing that you have two outputs for your binary classifier, one for class A and one for class B and you are getting roughly the same value for a given sample. If that's the case, then the first thing to do is to use 1 output. A binary classification problem is better modeled with 1 output that ranges between 0 and 1 (sigmoid the output neuron). This way there will be no ambiguity, the network will have to choose one or the other, or when it's confused you'll get ~0.5 and it will be clear.
Second, it is very common for a network to start learning well and then to perform more poorly after overtraining. Especially with small datasets such as what you have. In fact, even with the little knowledge I have of your dataset I would put a small bet on you getting better performance out of an algorithm like XGA Boost than a neural network (I assume you're using a neural net and not literally a perceptron).
But regarding the performance degrading over time. When this happens you want to look into something called "early stopping". At some point the network will start memorizing the input, and may be part of what's happening. Essentially you train until the performance on your held out test data starts to worsen.
To address this you can apply various forms of regularization (L2 regularization, dropout, batch normalization all come to mind). You can also reduce the size of your network. 5 layers of 256 neurons sounds too big for the problem. Try trimming this down and I bet your results will improve. There is a sweet spot for architecture size in neural networks. When your network is too large it can, and often will, over fit. When it's too small it won't be expressive enough for the data. Angrew Ng's coursera class has some helpful practical advice on dealing with this.