I've been working on a DNNClassifier model in Tensorflow, and I have gotten the model to train, evaluate and output results.
Here is the code I'm using to output my predictions:
predictions = classifier.predict(input_fn=predict_input_fn)
i = 0
for j, p in enumerate(predictions):
print("Prediction %s: %s" % (j + 1, p["probabilities"]))
i = i + 1
if i > 100:
break
i is being used to limit the results for testing purposes, because there's roughly 16,000 results being output and I don't need to see all of them for the purposes of what I'm trying to do right now.
The output I'm getting looks like this:
Prediction 1: [ 5.11678644e-02 9.48832154e-01 3.84762299e-37]
Prediction 2: [ 0.0352843 0.96471566 0. ]
Prediction 3: [ 1.04001068e-04 9.99895930e-01 0.00000000e+00]
Prediction 4: [ 0.0323724 0.96762753 0. ]
I know that these are probabilities of some kind, but I can't find documentation on what each one means. There's three per row, but only two categories, so I am guessing that one of them is a measure of certainty?
I realize that this is not strictly a programming question, rather it is more about output and documentation. However, before asking, I did look on other StackExchange sites and the TensorFlow documentation itself to try and find a better place to ask/an answer. The AI Stack Exchange website is still in beta and appears to have very little TensorFlow related activity (which is understandable, since many TensorFlow questions are programming questions), and I have had reasonable success on StackOverflow before when it comes to TF questions.
It looks like you are classifying between three categories (labels). So what you are seeing in the predictions is your networks weighted guess for each possible category (label). For instance, in your first prediction the network results can be interpreted as: there is ~5% chance of the data belonging to the first category (label), a ~95% chance of the data belonging to the second category (label), and ~0% chance of the data belonging to the third category (label).
Related
Thank you for reading. I'm not good at English.
I am wondering how to predict and get future time series data after model training. I would like to get the values after N steps.
I wonder if the time series data has been properly learned and predicted.
How i do this right get the following (next) value?
I want to get the next value using like model.predict or etc
I have x_test and x_test[-1] == t, so the meaning of the next value is t+1, t+2, .... t+n,
In this example I want to get predictions of the next t+1, t+2 ... t+n
First
I tried using stock index data
inputs = total_data[len(total_data) - forecast - look_back:]
inputs = scaler.transform(inputs)
X_test = []
for i in range(look_back, inputs.shape[0]):
X_test.append(inputs[i - look_back:i])
X_test = np.array(X_test)
predicted = model.predict(X_test)
but the result is like below
The results from X_test[-20:] and the following 20 predictions looks like same.
I'm wondering if it's the correct train and predicted value.
I'm wondering if it was a right training and predict.
full source
The method I tried first did not work correctly.
Seconds
I realized something is wrong, I tried using another official data
So, I used the time series in the Tensorflow tutorial to practice predicting the model.
a = y_val[-look_back:]
for i in range(N-step prediction): # predict a new value n times.
tmp = model.predict(a.reshape(-1, look_back, num_feature)) # predicted value
a = a[1:] # remove first
a = np.append(a, tmp) # insert predicted value
The results were predicted in a linear regression shape very differently from the real data.
Output a linear regression that is independent of the real data:
full source (After the 25th line is my code.)
I'm really very curious what is a standard method of predicting next values of a stock market.
Thank you for reading the long question. I seek advice about your priceless opinion.
Q : "How can I find a standard method of predicting next values of a stock market...?"
First - salutes to C64 practitioner!
Next, let me say, there is no standard method - there cannot be ( one ).
Principally - let me draw from your field of a shared experience - one can easily predict the near future flow of laminar fluids ( a technically "working" market instrument - is a model A, for which one can derive a better or worse predictive tool )
That will never work, however, for turbulent states of the fluids ( just read the complexity of the attempts to formulate the many-dimensional high-order PDE for a turbulence ( and it still just approximates the turbulence ) ) -- and this is the fundamentally "working" market ( after some expected fundamental factor was released ( read NFP or CPI ) or some flash-news was announced in the news - ( read a Swiss release of currency-bonding of CHF to some USD parity or Cyprus one time state tax on all speculative deposits ... the financial Big Bangs follow ... )
So, please, do not expect one, the less any simple, model for reasonably precise predictions, working for both the laminar and turbulent fluidics - the real world is for sure way more complex than this :o)
I am playing around with DeepExplainer to get shap values for deep learning models. By following some tutorials I can get some results, i.e. what variables are pushing the model prediction from the base value, which is the average model output in training set.
I have around 5,000 observations along with 70 features. The performance of DeepExplainer is quite satisfactory. And my code is:
model0 = load_model(model_p+'health0.h5')
background = healthScaler.transform(train[healthFeatures])
e = shap.DeepExplainer(model0, background)
shap_values = e.shap_values(healthScaler.transform(test[healthFeatures]))
test2 = test[healthFeatures].copy()
test2[healthFeatures] = healthScaler.transform(test[healthFeatures])
shap.force_plot(e.expected_value[0], shap_values[0][947,:], test2.iloc[947,:])
And the plot is the following:
Here the base value is 0.012 (can also be seen through e.expected_value[0]) and very close to the output value which is 0.01.
At this point I have some questions:
1) The output value is not identical to the prediction gotten through model0.predict(test[healthFeatures])[947] = -0.103 How should I assess output value?
2) As can be seen, I am using whole training set as the background to approximate conditional expectations of SHAP values. What is the difference between using random samples from training set and entire set? Is it only related to performance issue?
Many thanks in advance!
Probably too late but stil a most common question that will benefit other begginers. To answer (1), the expected and out values will be different. the expected is, as the name suggest, is the avereage over the scores predicted by your model, e.g., if it was probability then it is the average of the probabilties that your model spits. For (2), as long as the backroung values are less then 5k, it wont change much, but if > 5k then your calculations will take days to finish.
See this (lines 21-25) for more comprehensive answers.
My task here is to find a way to get a suggested value of the most important feature or features. By changing into the suggested values of the features, I want the classification result to change as well.
Snapshot of dataset
The following is the procedures that I have tried so far:
Import dataset (shape: 1162 by 22)
Build a simple neural network (2 hidden layers)
Since the dependent variable is simply either 0 or 1 (classification problem), I onehot-encoded the variable. So it's either [0, 1] or [1,0]
After splitting into train & test data, I train my NN model and got accuracy of 77.8%
To know which feature (out of 21) is the most important one in the determination of either 0 or 1, I trained the data using Random Forest classifier (scikit-learn) and also got 77.8% accuracy and then used the 'feature_importances_' offered by the random forest classifier.
As a result, I found out that a feature named 'a_L4' ranks the highest in terms of relative feature importance.
The feature 'a_L4' is allowed to have a value from 0 to 360 since it means an angle. In the original dataset, 'a_L4' comprises of only 12 values that are [5, 50, 95, 120, 140, 160, 185, 230, 235, 275, 320, 345].
I augmented the original dataset by directly adding all the possible 12 values for each cases giving a new dataset of shape (1162x12 by 22).
I imported the augmented dataset and tested it on the previously trained NN model. The result was a FAILURE. There hardly was any change in the classification meaning almost no '1's switched to '0's.
My conclusion was that changing the values of 'a_L4' was not enough to bring a change in the classification. So I additionally did the same procedure again for the 2nd most important feature which in this case was 'b_L7_p1'.
So writing all the possible values that the two most important features can have, now the new dataset becomes the shape of (1162x12x6 by 22). 'b_L7_p1' is allowed to have 6 different values only, thus the multiplication by 6.
Again the result was a FAILURE.
So, my question is what might have I done wrong in the procedure described above? Do I need to keep searching for more important features and augment the data with all the possible values they can have? But since this is a tedious task with multiple procedures to be done manually and leads to a dataset with a huge size, I wish there was a way to construct an inference-based NN model that can directly give out the suggested values of a certain feature or features.
I am relatively new to this field of research, so could anyone please tell me some key words that I should search for? I cannot find any work or papers regarding this issue on Google.
Thanks in advance.
In this case I would approach the problem in the following way:
Normalize the whole dataset. As you can see from the dataset your features have different scales. It is utterly important that you make all features to have the same scale. Have a look at: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
The second this that I would do now is train and evaluate a model (It can be whatever you want) to get a so called baseline model.
Then, I would try PCA to see whether all features are needed. Maybe you are including unnecessary sparsity to the model. See: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
For example if you set the n_components in PCA to be 0.99 then you are reducing the number of features while retaining as 0.99 explained variance.
Then I would train the model to see whether there is any improvement. Please note that only by adding the normalization itself there should be an improvement.
If I want to see by the dataset itself which features are important I would do: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html This would select a specified number of features based on some statistical test lets say: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html
Train a model and evaluate it again to see whether there is some improvement.
Also, you should be aware that the NNs can perform feature engineering by themselves, so computing feature importance is redundant in a way.
Let me know whether you will see any improvements.
I am trying to implement active learning machine(an experiment for a project) algorithm, where I want to train separately, please check my code below.
clf = BernoulliNB()
clf.fit(X_train[0:40], y_train[0:40])
clf.fit(X_train[40:], y_train[40:])
The above usually done like this
clf = BernoulliNB()
clf.fit(X_train, y_train)
Both have different accuracy score. I want to add training data to existing model itself since its computationally expensive - I don't want my model to do one more time computation.
Any way I can ?
You should use partial_fit to train your model in batches.
clf = BernoulliNB()
clf.partial_fit(X_train[0:40], y_train[0:40])
clf.partial_fit(X_train[40:], y_train[40:])
Please check this to know more about the function.
Hope this helps:)
This is called online training or Incremental learning used for large data. Please see this page for strategies.
Essentially, in scikit-learn, you need partial_fit() with all the labels in y known in advance.
partial_fit(X, y, classes=None, sample_weight=None)
classes : array-like, shape = [n_classes] (default=None)
List of all the classes that can possibly appear in the y vector. Must be provided at the first call to partial_fit, can be omitted in subsequent calls.
If you simply do this:
clf.partial_fit(X_train[0:40], y_train[0:40])
clf.partial_fit(X_train[40:], y_train[40:])
Then there is a possibility that that if any class which is not present in the first 40 samples, and comes in next iterations of partial_fit(), then it will throw an error.
So ideally you should be doing this:
# First call
clf.partial_fit(X_train[0:40], y_train[0:40], classes = np.unique(y_train))
# subsequent calls
clf.partial_fit(X_train[40:80], y_train[40:80])
clf.partial_fit(X_train[80:], y_train[80:])
and so on..
I have some data in a pandas dataframe (although pandas is not the point of this question). As an experiment I made column ZR as column Z divided by column R. As a first step using scikit learn I wanted to see if I could predict ZR from the other columns (which should be possible as I just made it from R and Z). My steps have been.
columns=['R','T', 'V', 'X', 'Z']
for c in columns:
results[c] = preprocessing.scale(results[c])
results['ZR'] = preprocessing.scale(results['ZR'])
labels = results["ZR"].values
features = results[columns].values
#print labels
#print features
regr = linear_model.LinearRegression()
regr.fit(features, labels)
print(regr.coef_)
print np.mean((regr.predict(features)-labels)**2)
This gives
[ 0.36472515 -0.79579885 -0.16316067 0.67995378 0.59256197]
0.458552051342
The preprocessing seems wrong as it destroys the Z/R relationship I think. What's the right way to preprocess in this situation?
Is there some way to get near 100% accuracy? Linear regression is the wrong tool as the relationship is not-linear.
The five features are highly correlated in my data. Is non-negative least squares implemented in scikit learn ? ( I can see it mentioned in the mailing list but not the docs.) My aim would be to get as many coefficients set to zero as possible.
You should easily be able to get a decent fit using random forest regression, without any preprocessing, since it is a nonlinear method:
model = RandomForestRegressor(n_estimators=10, max_features=2)
model.fit(features, labels)
You can play with the parameters to get better performance.
The solutions is not as easy and can be very influenced by your data.
If your variables R and Z are bounded (for ex 0<R<1 -3<Z<2) then you should be able to get a good estimation of the output variable using neural network.
Using neural network you should be able to estimate your output even without preprocessing the data and using all the variables as input.
(Of course here you will have to solve a minimization problem).
Sklearn do not implement neural network so you should use pybrain or fann.
If you want to preprocess the data in order to make the minimization problem easier you can try to extract the right features from the predictor matrix.
I do not think there are a lot of tools for non linear features selection. I would try to estimate the important variables from you dataset using in this order :
1-lasso
2- sparse PCA
3- decision tree (you can actually use them for features selection ) but I would avoid this as much as possible
If this is a toy problem I would sugges you to move towards something of more standard.
You can find a lot of examples on google.