What metric to use to define model performance when the change in dependent variable is very small? - pandas

I have built a regression model with 5 inputs and 1 output.
I am using r2_score as a metric to evaluate my model performance.
#calculate r2_score
from sklearn.metrics import r2_score
score_test = r2_score(y_pred,y_test)
Variations in my output variable is very small. My output variable look like:
102.23003
102.23007
102.22958
102.22858
102.22691
102.2246
102.22179
102.21818
102.21372
102.20828
102.20172
102.193886
102.18463
102.1738
102.160164
102.14266
Distribution of my dependent variable
Variations are only in the second decimal level.
When I use r2_score as an accuracy metric , the r2_score comes out to be 99%.
So my question is, is r2_score a correct metric in such cases where the variation in dependent variable is so small?
Does this 99% r2_score imply my model is performing very well?

In the comments you ask about algorithm and performance metrics. Here is what I did: I pasted your data into my online open source statistical distributions fitter at http://zunzun.com/StatisticalDistributions/1/ and hit the Submit button. It fit the data to the 90+ continuous statistical distributions in scipy.stats, and the generalized Pareto distribution was near the top of the results, yielding:
Generalized Pareto distribution
http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.genpareto.html
Fit Statistics for 16 data points:
Negative Two Log Likelihood = -1.3852573661570938E+02
AIC = -1.3252573661570938E+02
AICc (Burnham and Anderson) = -1.3052573661570938E+02
Parameters:
c = -3.7800889226684840E+00
location = 1.0213689198388039E+02
scale = 3.5222118656995849E-01

Related

How to see the indices of the split on the data that GridSearchCV used when it made the split?

When using GridSearchCV() to perform a k-fold cross validation analysis on some data is there a way to know which data was used for each split?
For example, assumed the goal is to build a binary classifier of your choosing, named 'model'. There are 100 data points (rows) with 5 features each and an associated 1 or 0 target. 20 of the 100 data points are held out for testing after training and hyperparameter tuning, GridSearchCV will never see those 20 data points. The other 80 data rows are put into the estimator as X and Y, so GridSearchCV will only see 80 rows of data. Various hyper parameters are tuned and laid out in the param_grid variable. For this case the cross validation parameter of cv is assigned a value of 3, as shown:
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3) grid_result = grid.fit(X, Y)
Is there a way to see which data was used as the training data and as the cross validation data for each fold? Maybe seeing which indices were used for the split?

Use of DeepExplainer to get shap values for an MLP model in Keras with tensorflow backend

I am playing around with DeepExplainer to get shap values for deep learning models. By following some tutorials I can get some results, i.e. what variables are pushing the model prediction from the base value, which is the average model output in training set.
I have around 5,000 observations along with 70 features. The performance of DeepExplainer is quite satisfactory. And my code is:
model0 = load_model(model_p+'health0.h5')
background = healthScaler.transform(train[healthFeatures])
e = shap.DeepExplainer(model0, background)
shap_values = e.shap_values(healthScaler.transform(test[healthFeatures]))
test2 = test[healthFeatures].copy()
test2[healthFeatures] = healthScaler.transform(test[healthFeatures])
shap.force_plot(e.expected_value[0], shap_values[0][947,:], test2.iloc[947,:])
And the plot is the following:
Here the base value is 0.012 (can also be seen through e.expected_value[0]) and very close to the output value which is 0.01.
At this point I have some questions:
1) The output value is not identical to the prediction gotten through model0.predict(test[healthFeatures])[947] = -0.103 How should I assess output value?
2) As can be seen, I am using whole training set as the background to approximate conditional expectations of SHAP values. What is the difference between using random samples from training set and entire set? Is it only related to performance issue?
Many thanks in advance!
Probably too late but stil a most common question that will benefit other begginers. To answer (1), the expected and out values will be different. the expected is, as the name suggest, is the avereage over the scores predicted by your model, e.g., if it was probability then it is the average of the probabilties that your model spits. For (2), as long as the backroung values are less then 5k, it wont change much, but if > 5k then your calculations will take days to finish.
See this (lines 21-25) for more comprehensive answers.

How to predict missing values in python using linear regression 3 year worth of data

Hey guys so i have these 3 years worth of data from 2012~2014, however the 2014 have a missing value to it (100 rows), i'm really not too sure on how to deal with it, this is my attempt at it:
X = red2012Mob.values
y = red2014Mob.values
X = X.reshape(-1,1)
y = y.reshape(-1,1)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
i'm not changing any data from the 2014 where it have missing value i just directly input it to the model
There is two ways:
Drop the instances with missing data (e.g. using red2012Mob.dropna(), or if it is time series, leave out complete blocks of missing data, e.g. start later in 2014).
Impute the missing data. Here however, you won't get a one size fits all answer, as it really depends on your data and your problem. Since you seem to have time series data, the simplest strategies for "small" holes is to us linear or constant interpolation. If time dependency is not so important, maybe the mean of the column may be a good strategy. For larger holes you may find a suitable model to fill the data. Sometimes a "naive" strategy like using the same value of a seasonality before (e.g. last monday's data for current monday) may work, or you use a KNN Imputer (either check out this sklearn PR or the package discussed here). For the simple strategies, there is also a module in the upcoming sklearn release.
In practice I usually combine methods. For instance up to some point I will try strategies of the second point, but if data is too bad it is usually better to have less "good" data than much of the imputed data.
I don't know if you have data for 2013 available with you. If it is available, my first recommendation would be to use that as well. As far as data for training goes, you should only take the data for 2014 with non-missing values and then fit your model using these values. Once you get a decent cross-validation accuracy on the model, you can take the subset of data with missing values for 2014 and use that to predict values for 2014.
For better understanding, here is a small piece of sample code to subset non nan values for a list/column:
import numpy as np
a1 = [v for v in a if not np.isnan(v)]

Tensorflow Linear Regression: Getting values for Adjusted R Square, Coefficients, P-value

There are few key parameters associated with Linear Regression e.g. Adjusted R Square, Coefficients, P-value, R square, Multiple R etc. While using google Tensorflow API to implement Linear Regression how are these parameter mapped? Is there any way we can get the value of these parameters after/during model execution
From my experience, if you want to have these values while your model runs then you have to hand code them using tensorflow functions. If you want them after the model has run you can use scipy or other implementations. Below are some examples of how you might go about coding R^2, MAPE, RMSE...
total_error = tf.reduce_sum(tf.square(tf.sub(y, tf.reduce_mean(y))))
unexplained_error = tf.reduce_sum(tf.square(tf.sub(y, prediction)))
R_squared = tf.sub(tf.div(total_error, unexplained_error),1.0)
R = tf.mul(tf.sign(R_squared),tf.sqrt(tf.abs(unexplained_error)))
MAPE = tf.reduce_mean(tf.abs(tf.div(tf.sub(y, prediction), y)))
RMSE = tf.sqrt(tf.reduce_mean(tf.square(tf.sub(y, prediction))))
I believe the formula for R2 should be the following. Note that it would go negative when the network is so bad that it does a worse job than the mere average as a predictor:
total_error = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
unexplained_error = tf.reduce_sum(tf.square(tf.subtract(y, pred)))
R_squared = tf.subtract(1.0, tf.divide(unexplained_error, total_error))
Adjusted_R_squared = 1 - [ (1-R_squared)*(n-1)/(n-k-1) ]
whereas n is the number of observations and k is the number of features.
You should not use a formula for R Squared. This exists in Tensorflow Addons. You will only need to extend it to Adjusted R Squared.
I would strongly recommend against using a recipe to calculate r-squared itself! The examples I've found do not produce consistent results, especially with just one target variable. This gave me enormous headaches!
The correct thing to do is to use tensorflow_addons.metrics.RQsquare(). Tensorflow Add Ons is on PyPi here and the documentation is a part of Tensorflow here. All you have to do is set y_shape to the shape of your output, often it is (1,) for a single output variable.
Then you can use what RSquare() returns in your own metric that handled the adjustments.

How to get scikit learn to find simple non-linear relationship

I have some data in a pandas dataframe (although pandas is not the point of this question). As an experiment I made column ZR as column Z divided by column R. As a first step using scikit learn I wanted to see if I could predict ZR from the other columns (which should be possible as I just made it from R and Z). My steps have been.
columns=['R','T', 'V', 'X', 'Z']
for c in columns:
results[c] = preprocessing.scale(results[c])
results['ZR'] = preprocessing.scale(results['ZR'])
labels = results["ZR"].values
features = results[columns].values
#print labels
#print features
regr = linear_model.LinearRegression()
regr.fit(features, labels)
print(regr.coef_)
print np.mean((regr.predict(features)-labels)**2)
This gives
[ 0.36472515 -0.79579885 -0.16316067 0.67995378 0.59256197]
0.458552051342
The preprocessing seems wrong as it destroys the Z/R relationship I think. What's the right way to preprocess in this situation?
Is there some way to get near 100% accuracy? Linear regression is the wrong tool as the relationship is not-linear.
The five features are highly correlated in my data. Is non-negative least squares implemented in scikit learn ? ( I can see it mentioned in the mailing list but not the docs.) My aim would be to get as many coefficients set to zero as possible.
You should easily be able to get a decent fit using random forest regression, without any preprocessing, since it is a nonlinear method:
model = RandomForestRegressor(n_estimators=10, max_features=2)
model.fit(features, labels)
You can play with the parameters to get better performance.
The solutions is not as easy and can be very influenced by your data.
If your variables R and Z are bounded (for ex 0<R<1 -3<Z<2) then you should be able to get a good estimation of the output variable using neural network.
Using neural network you should be able to estimate your output even without preprocessing the data and using all the variables as input.
(Of course here you will have to solve a minimization problem).
Sklearn do not implement neural network so you should use pybrain or fann.
If you want to preprocess the data in order to make the minimization problem easier you can try to extract the right features from the predictor matrix.
I do not think there are a lot of tools for non linear features selection. I would try to estimate the important variables from you dataset using in this order :
1-lasso
2- sparse PCA
3- decision tree (you can actually use them for features selection ) but I would avoid this as much as possible
If this is a toy problem I would sugges you to move towards something of more standard.
You can find a lot of examples on google.