How can i understand my model overfit or underfit from root mean squared error?

How can i understand my model overfit or underfit from root mean squared error? - data-science

I know this interpretation with R square but, I have to do it with root mean squared error.
For example: For the training set i have root mean squared error for the 2nd degree 32.5, 3rd degree 29.2, 4th degree 27.5. On the other side, for the validation set i have root mean squared error for the 2nd degree 34.2, 3rd degree 32.3, 4th degree 35.8. I have some interpretation on it, i guess 4th degree is overfitting but i could not interpret anything for 2nd and 3rd degree.

The degree in your case represents model complexity. The model will generally perform better on the training set as you increase complexity -- the RMSE will decline. Also the performance on the validation set will generally increase as the model gets more complex because it will be able to capture the patterns in the data better. But this holds only until some point. When your model gets too complex (in your case the degree gets too high), it will copy the data too tightly and won't generalize -- its performance on the data it hadn't seen during training will suffer. The unseen data is your validation set. In your case, the step in complexity from 2nd degree to 3rd degree increased performance on both the training set and the validation set. However, when you tried 4th degree model, its performance on the validation set declined. That's the sign of overfitting.

Related

What’s the advantage of using LSTM for time series predict as opposed to Regression?

In neural networks, in general, which model should yield a better and accurate output between both for time series?

As you rightly mentioned, We can use linear regression with time series data as long as:
The inclusion of lagged terms as regressors does not create a collinearity problem.
Both the regressors and the explained variable are stationary.
Your errors are not correlated with each other.
The other linear regression assumptions apply.
No autocorrelation is the single most important assumption in linear regression. If autocorrelation is present the consequences are the following:
Bias: Your “best fit line” will likely be way off because it will be pulled away from the “true line” by the effect of the lagged errors.
Inconsistency: Given the above, your sample estimators are unlikely to converge to the population parameters.
Inefficiency: While it is theoretically possible, your residuals are unlikely to be homoskedastic if they are autocorrelated. Thus, your confidence intervals and your hypothesis tests will be unreliable.
While, The Long Short Term Memory neural network is a type of a Recurrent Neural Network (RNN). RNNs use previous time events to inform the later ones. For example, to classify what kind of event is happening in a movie, the model needs to use information about previous events. RNNs work well if the problem requires only recent information to perform the present task. If the problem requires long term dependencies, RNN would struggle to model it. The LSTM was designed to learn long term dependencies. It remembers the information for long periods.
To focus on the 1st sequence. The model takes the feature of the time bar at index 0 and it tries to predict the target of the time bar at index 1. Then it takes the feature of the time bar at index 1 and it tries to predict the target of the time bar at index 2, etc. The feature of 2nd sequence is shifted by 1 time bar from the feature of 1st sequence, the feature of 3rd sequence is shifted by 1 time bar from 2nd sequence, etc. With this procedure, we get many shorter sequences that are shifted by a single time bar.

Dealing with Error in Neural Network input

When you are building a neural network in which the input values are known to have error is there a way to incorporate this into the network? I.e one value of the input may have a known small error and so it's value is a good estimate; but another may have a larger standard error and so you are less confident in its true value.
Googling around this question is not easy because it's mostly Error Messages or error in the output that pops up so if someone here knows offhand that would be great thanks!

One possibility would be to use some inverse of the error as a weight during training. Basically when you are calculating the loss of one input example during training you multiply it by its weight to. A higher weight leads to a higher loss and a higher impact on the gradient and the change of the wheights.
By choosing for example 1 / standard error as the weight, a false estimation of an input with high uncertainty is not weighted as much as a certain example.

Reason why setting tensorflow's variable with small stddev

I have a question about a reason why setting TensorFlow's variable with small stddev.
I guess many people do test MNIST test code from TensorFlow beginner's guide.
As following it, the first layer's weights are initiated by using truncated_normal with stddev 0.1.
And I guessed if setting it with more bigger value, then it would be the same result, which is exactly accurate.
But although increasing epoch count, it doesn't work.
Is there anybody know this reason?
original :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=0.1), name='w_'+name)
#result : (990, 0.93000001, 0.89719999)
modified :
W_layer = tf.Variable(tf.truncated_normal([inp.get_shape()[1].value, size],stddev=200), name='w_'+name)
#result : (99990, 0.1, 0.098000005)

The reason is because you want to keep all the layer's variances (or standard deviations) approximately the same, and sane. It has to do with the error backpropagation step of the learning process and the activation functions used.
In order to learn the network's weights, the backpropagation step requires knowledge of the network's gradient, a measure of how strong each weight influences the input to reach the final output; layer's weight variance directly influences the propagation of gradients.
Say, for example, that the activation function is sigmoidal (e.g. tf.nn.sigmoid or tf.nn.tanh); this implies that all input values are squashed into a fixed output value range. For the sigmoid, it is the range 0..1, where essentially all values z greater or smaller than +/- 4 are very close to one (for z > 4) or zero (for z < -4) and only values within that range tend to have some meaningful "change".
Now the difference between the values sigmoid(5) and sigmoid(1000) is barely noticeable. Because of that, all very large or very small values will optimize very slowly, since their influence on the result y = sigmoid(W*x+b) is extremely small. Now the pre-activation value z = W*x+b (where x is the input) depends on the actual input x and the current weights W. If either of them is large, e.g. by initializing the weights with a high variance (i.e. standard deviation), the result will necessarily be (relatively) large, leading to said problem. This is also the reason why truncated_normal is used rather than a correct normal distribution: The latter only guarantees that most of the values are very close to the mean, with some less than 5% chance that this is not the case, while truncated_normal simply clips away every value that is too big or too small, guaranteeing that all weights are in the same range, while still being normally distributed.
To make matters worse, in a typical neural network - especially in deep learning - each network layer is followed by one or many others. If in each layer the output value range is big, the gradients will get bigger and bigger as well; this is known as the exploding gradients problem (a variation of the vanishing gradients, where gradients are getting smaller).
The reason that this is a problem is because learning starts at the very last layer and each weight is adjusted depending on how much it contributed to the error. If the gradients are indeed getting very big towards the end, the very last layer is the first one to pay a high toll for this: Its weights get adjusted very strongly - likely overcorrecting the actual problem - and then only the "remaining" error gets propagated further back, or up, the network. Here, since the last layer was already "fixed a lot" regarding the measured error, only smaller adjustments will be made. This may lead to the problem that the first layers are corrected only by a tiny bit or not at all, effectively preventing all learning there. The same basically happens if the learning rate is too big.
Finding the best weight initialization is a topic by itself and there are somewhat more sophisticated methods such as Xavier initialization or Layer-sequential unit variance, however small normally distributed values are usually simply a good guess.

How to speed up the rjags model training in Bayesian ranking?

All,
I am doing Bayesian modeling using rjags. However, when the number of observation is larger than 1000. The graph size is too big.
More specifically, I am doing a Bayesian ranking problem. Traditionally, one observation means one X[i, 1:N]-Y[i] pair, where X[i, 1:N] means the i-th item is represented by a N-size predictor vector, and Y[i] is a response. The objective is to minimize the point-wise error of predicted values,for example, least square error.
A ranking problem is different. Since we more care about the order, we use a pair-wise 1-0 indicator to represent the order between Y[i] and Y[j], for example, when Y[i]>Y[j], I(i,j)=1; otherwise I(i,j)=0. We treat this 1-0 indicator as an observation. Therefore, assuming we have K items: Y[1:K], the number of indicator is 0.5*K*(K-1). Hence when K is increased from 500 to 5000, the number of observations is very large, i.e. from 500^2 to 5000^2. The garph size of the rjags model is large too, for example graph size > 500,000. And the log-posterior will be very small.
And it takes a long time to complete the training. I think the consumed time is >40 hours. It is not practical for me to do further experiment. Therefore, do you have any idea to speed up the rjags. I heard that the RStan is faster than Rjags. Any one who has similar experience?

Identical Test set

I have some comments and i want to classify them as Positive or Negative.
So far i have an annotated dataset .
The thing is that the first 100 rows are classified as positive and the rest 100 as Negative.
I am using SQL Server Analysis-2008 R2. The Class attribute has 2 values, POS-for positive and NEG-for negative.
Also i use Naive Bayes algorithm with maximum input/output attributes=0 (want to use all the attributes) for the classification, the test set max case is set to 30%. The current score from the Lift Chart is 0.60.
Do i have to mix them up, for example 2 POS followed by 1 NEG, in order to get better classification accuracy?

The ordering of the learning instances should not affect classification performance. The probabilities computed by Naive Bayes will be the same for any ordering of instances in the data set.
However, the selection of different test and training sets can affect classification performance. For example, some instances might be inherently more difficult to classify than others.
Are you getting similarly poor training and test performance? If your training performance is good and/or much better than your test performance, your model may be over-fitted. Otherwise, if your training performance is also poor, I would suggest (a) trying a better/stronger/more expressive classifier, e.g., SVM, decision trees etc; and/or (b) making sure your features are representive/expressive enough of the data.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas