How to calculate Normalised Mean Square Error (NMSE) and why to use it? - error-handling

I've been told I need to normalise my MSE for my thesis involving neural networks.
Equations for NMSE seem a bit few and far-between. I have the following and want to corroborate it if possible:
Is the standard deviation term supposed to be calculated from the target values or the predicted values?
Also, what are the main advantages for using MSE over NMSE? Is it just that it makes error comparisons easier, because of the simpler scale?
Many thanks for any help!

def nmser(x,y):
z=0
if len(x)==len(y):
for k in range(len(x)):
z = z + (((x[k]-y[k])**2)/x[k])
z = z/(len(x))
return z

Related

Calculating covariance of a 3-dim matrix using einsum

I've got an array of time series data of shape (2466, 2498, 9) ((asset, date, feature)).
I've got 9 features, on which I want to do PCA to reduce the dimensionality on this axis.
I'm struggling to calculate the covariance matrix, Z = X.T # X.
I think I want to express this as an einsum, but I'm not sure how. I'm certainly interested in other methods as well, as the purpose of this is to learn numpy, rather than actually solve a problem.
Edit: This is my (apparently wrong) attempt so far:
np.einsum('ijk,ijl->ijkl',myData, myData)`
(This just hangs my system.)
Edit 2:
I've come to understand that I should be using np.linalg.svd for this problem.

Tensorflow Linear Regression: Getting values for Adjusted R Square, Coefficients, P-value

There are few key parameters associated with Linear Regression e.g. Adjusted R Square, Coefficients, P-value, R square, Multiple R etc. While using google Tensorflow API to implement Linear Regression how are these parameter mapped? Is there any way we can get the value of these parameters after/during model execution
From my experience, if you want to have these values while your model runs then you have to hand code them using tensorflow functions. If you want them after the model has run you can use scipy or other implementations. Below are some examples of how you might go about coding R^2, MAPE, RMSE...
total_error = tf.reduce_sum(tf.square(tf.sub(y, tf.reduce_mean(y))))
unexplained_error = tf.reduce_sum(tf.square(tf.sub(y, prediction)))
R_squared = tf.sub(tf.div(total_error, unexplained_error),1.0)
R = tf.mul(tf.sign(R_squared),tf.sqrt(tf.abs(unexplained_error)))
MAPE = tf.reduce_mean(tf.abs(tf.div(tf.sub(y, prediction), y)))
RMSE = tf.sqrt(tf.reduce_mean(tf.square(tf.sub(y, prediction))))
I believe the formula for R2 should be the following. Note that it would go negative when the network is so bad that it does a worse job than the mere average as a predictor:
total_error = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
unexplained_error = tf.reduce_sum(tf.square(tf.subtract(y, pred)))
R_squared = tf.subtract(1.0, tf.divide(unexplained_error, total_error))
Adjusted_R_squared = 1 - [ (1-R_squared)*(n-1)/(n-k-1) ]
whereas n is the number of observations and k is the number of features.
You should not use a formula for R Squared. This exists in Tensorflow Addons. You will only need to extend it to Adjusted R Squared.
I would strongly recommend against using a recipe to calculate r-squared itself! The examples I've found do not produce consistent results, especially with just one target variable. This gave me enormous headaches!
The correct thing to do is to use tensorflow_addons.metrics.RQsquare(). Tensorflow Add Ons is on PyPi here and the documentation is a part of Tensorflow here. All you have to do is set y_shape to the shape of your output, often it is (1,) for a single output variable.
Then you can use what RSquare() returns in your own metric that handled the adjustments.

Constrained np.polyfit

I am trying to fit a quadratic to some experimental data and using polyfit in numpy. I am looking to get a concave curve, and hence want to make sure that the coefficient of the quadratic term is negative, also the fit itself is weighted, as in there are some weights on the points. Is there an easy way to do that? Thanks.
The use of weights is described here (numpy.polyfit).
Basically, you need a weight vector with the same length as x and y.
To avoid the wrong sign in the coefficient, you could use a fit function definition like
def fitfunc(x,a,b,c):
return -1 * abs(a) * x**2 + b * x + c
This will give you a negative coefficient for x**2 at all times.
You can use curve_fit
.
Or you can run polyfit with rank 2 and if the last coefficient is bigger than 0. run again linear polyfit (polyfit with rank 1)

How to get scikit learn to find simple non-linear relationship

I have some data in a pandas dataframe (although pandas is not the point of this question). As an experiment I made column ZR as column Z divided by column R. As a first step using scikit learn I wanted to see if I could predict ZR from the other columns (which should be possible as I just made it from R and Z). My steps have been.
columns=['R','T', 'V', 'X', 'Z']
for c in columns:
results[c] = preprocessing.scale(results[c])
results['ZR'] = preprocessing.scale(results['ZR'])
labels = results["ZR"].values
features = results[columns].values
#print labels
#print features
regr = linear_model.LinearRegression()
regr.fit(features, labels)
print(regr.coef_)
print np.mean((regr.predict(features)-labels)**2)
This gives
[ 0.36472515 -0.79579885 -0.16316067 0.67995378 0.59256197]
0.458552051342
The preprocessing seems wrong as it destroys the Z/R relationship I think. What's the right way to preprocess in this situation?
Is there some way to get near 100% accuracy? Linear regression is the wrong tool as the relationship is not-linear.
The five features are highly correlated in my data. Is non-negative least squares implemented in scikit learn ? ( I can see it mentioned in the mailing list but not the docs.) My aim would be to get as many coefficients set to zero as possible.
You should easily be able to get a decent fit using random forest regression, without any preprocessing, since it is a nonlinear method:
model = RandomForestRegressor(n_estimators=10, max_features=2)
model.fit(features, labels)
You can play with the parameters to get better performance.
The solutions is not as easy and can be very influenced by your data.
If your variables R and Z are bounded (for ex 0<R<1 -3<Z<2) then you should be able to get a good estimation of the output variable using neural network.
Using neural network you should be able to estimate your output even without preprocessing the data and using all the variables as input.
(Of course here you will have to solve a minimization problem).
Sklearn do not implement neural network so you should use pybrain or fann.
If you want to preprocess the data in order to make the minimization problem easier you can try to extract the right features from the predictor matrix.
I do not think there are a lot of tools for non linear features selection. I would try to estimate the important variables from you dataset using in this order :
1-lasso
2- sparse PCA
3- decision tree (you can actually use them for features selection ) but I would avoid this as much as possible
If this is a toy problem I would sugges you to move towards something of more standard.
You can find a lot of examples on google.

Statistical procedure decision

I have two problems in hand :
I have a dependant variable, lets say GDP, and many other independant variables. I need to know what procedure I can use to find which among the IVs are leading or lagging indicators. I have develop the model in SAS and Excel.
Based on some buy sell rules based out of x day ema and y day sma cross, I need to compute returns. I need to know which procedure I should use to find what values of x and y will give me the best returns (x and y being an array of prefixed values like (200,50)(300,30), etc.). Can a neural network be used here? If so can anyone give me a link to some documentation as to how to carry this out?
Ad 1: probably easiest is to calculate the linear correlation between the time series. Using both simultaneous and shifted time series will tell you something about lead/lag.
Ad 2: look into optimization, not neural networks. Initial and easiest approach is to use grid search: calculate the best returns for each combination of X and Y. Pseudocode:
x = [50:50:500]
y = [10:10:100]
for i in x:
for j in y:
return(i,j) = calculate_returns(x(i),y(j))
end
end