'key of type tuple not found and not a MultiIndex' while generating ROC for multi-class classification - multi-index

I am trying to generate a ROC curve using XGBoost through a multi-class classification but facing this 'key of type tuple not found and not a MultiIndex' everytime.
Classification:
from xgboost import XGBClassifier
from xgboost import plot_tree
from sklearn import metrics
from sklearn.metrics import roc_curve, auc
from itertools import cycle
from sklearn.metrics import roc_auc_score
model = XGBClassifier()
model = model.fit(x_train, y_train)
print('Accuracy:', model.score(x_test,y_test))
score=cross_val_score(model,X,y,cv=5)
print(score)
print('CV Score:',np.mean(score))
y_pred1=model.predict(x_test)
Generating ROC:
n_classes = 5
fpr = dict()
tpr = dict()
roc_auc = dict()
lw=2
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred1[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
colors = cycle(['blue', 'red', 'green', 'yellow', 'pink'])
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=2,
label='ROC curve of class {0} (area = {1:0.2f})'
''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([-0.05, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic for multi-class data')
plt.legend(loc="lower right")
plt.show()
Out:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-34-14f08a1b6222> in <module>
5 lw=2
6 for i in range(n_classes):
----> 7 fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_pred1[:, i])
8 roc_auc[i] = auc(fpr[i], tpr[i])
9 colors = cycle(['blue', 'red', 'green', 'yellow', 'pink'])
2 frames
/usr/local/lib/python3.8/dist-packages/pandas/core/series.py in _get_values_tuple(self, key)
1014
1015 if not isinstance(self.index, MultiIndex):
-> 1016 raise KeyError("key of type tuple not found and not a MultiIndex")
1017
1018 # If key is contained, would have returned by now
KeyError: 'key of type tuple not found and not a MultiIndex'
Q: Why is it returning a multi-index error even after I have 5 classes in my dataframe?

Related

I'm having problems with one-hot encoding

I am using logistic regression for a football dataset, but it seems when i try to one-hot encode the home team names and away team names it gives the model a 100% accuracy, even when doing a train_test_split i still get 100. What am i doing wrong?
from sklearn.linear_model
import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
df = pd.read_csv("FIN.csv")
df['Date'] = pd.to_datetime(df["Date"])
df = df[(df["Date"] > '2020/04/01')]
df['BTTS'] = np.where((df.HG > 0) & (df.AG > 0), 1, 0)
#print(df.to_string())
df.dropna(inplace=True)
x = df[['Home', 'Away', 'Res', 'HG', 'AG', 'PH', 'PD', 'PA', 'MaxH', 'MaxD', 'MaxA', 'AvgH', 'AvgD', 'AvgA']].values
y = df['BTTS'].values
np.set_printoptions(threshold=np.inf)
model = LogisticRegression()
ohe = OneHotEncoder(categories=[df.Home, df.Away, df.Res], sparse=False)
x = ohe.fit_transform(x)
print(x)
model.fit(x, y)
print(model.score(x, y))
x_train, x_test, y_train, y_test = train_test_split(x, y, shuffle=False)
model.fit(x_train, y_train)
print(model.score(x_test, y_test))
y_pred = model.predict(x_test)
print("accuracy:",
accuracy_score(y_test, y_pred))
print("precision:", precision_score(y_test, y_pred))
print("recall:", recall_score(y_test, y_pred))
print("f1 score:", f1_score(y_test, y_pred))
Overfitting would be a situation where your training accuracy is very high, and your test accuracy is very low. That means it's "over fitting" because it essentially just learns what the outcome will be on the training, but doesn't fit well on new, unseen data.
The reason you are getting 100% accuracy is precisely as I stated in the comments, there's a (for lack of a better term) data leakage. You are essentially allowing your model to "cheat". Your target variable y (which is 'BTTS') is feature engineered by the data. It is derived from 'HG' and 'AG', and thus are highly (100%) correlated/associated to your target. You define 'BTTS' as 1 when both 'HG' and 'AG' are greater than 1. And then you have those 2 columns included in your training data. So the model simply picked up that obvious association (Ie, when the home goals is 1 or more, and the away goals are 1 or more -> Both teams scored).
Once the model sees those 2 values greater than 0, it predicts 1, if one of those values is 0, it predicts 0.
Drop 'HG' and 'AG' from the x (features).
Once we remove those 2 columns, you'll see a more realistic performance (albeit poor - slightly better than a flip of the coin) here:
1.0
0.5625
accuracy: 0.5625
precision: 0.6666666666666666
recall: 0.4444444444444444
f1 score: 0.5333333333333333
With the Confusion Matrix:
from sklearn.metrics import confusion_matrix
labels = labels = np.unique(y).tolist()
cf_matrixGNB = confusion_matrix(y_test, y_pred, labels=labels)
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.heatmap(cf_matrixGNB, annot=True,
cmap='Blues')
ax.set_title('Confusion Matrix\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()
Another option would be to do a calculated field of 'Total_Goals', then see if it can predict on that. Obviously again, it has a little help in the obvious (if 'Total_Goals' is 0 or 1, then 'BTTS' will be 0.). But then if 'Total_Goals' is 2 or more, it'll have to rely on the other features to try to work out if one of the teams got shut out.
Here's that example:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
df = pd.read_csv("FIN.csv")
df['Date'] = pd.to_datetime(df["Date"])
df = df[(df["Date"] > '2020/04/01')]
df['BTTS'] = np.where((df.HG > 0) & (df.AG > 0), 1, 0)
#print(df.to_string())
df.dropna(inplace=True)
df['Total_Goals'] = df['HG'] + df['AG']
x = df[['Home', 'Away', 'Res', 'Total_Goals', 'PH', 'PD', 'PA', 'MaxH', 'MaxD', 'MaxA', 'AvgH', 'AvgD', 'AvgA']].values
y = df['BTTS'].values
np.set_printoptions(threshold=np.inf)
model = LogisticRegression()
ohe = OneHotEncoder(sparse=False)
x = ohe.fit_transform(x)
#print(x)
model.fit(x, y)
print(model.score(x, y))
x_train, x_test, y_train, y_test = train_test_split(x, y, shuffle=False)
model.fit(x_train, y_train)
print(model.score(x_test, y_test))
y_pred = model.predict(x_test)
print("accuracy:",
accuracy_score(y_test, y_pred))
print("precision:", precision_score(y_test, y_pred))
print("recall:", recall_score(y_test, y_pred))
print("f1 score:", f1_score(y_test, y_pred))
from sklearn.metrics import confusion_matrix
labels = np.unique(y).tolist()
cf_matrixGNB = confusion_matrix(y_test, y_pred, labels=labels)
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.heatmap(cf_matrixGNB, annot=True,
cmap='Blues')
ax.set_title('Confusion Matrix\n');
ax.set_xlabel('\nPredicted Values')
ax.set_ylabel('Actual Values ');
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()
Output:
1.0
0.8
accuracy: 0.8
precision: 0.8536585365853658
recall: 0.7777777777777778
f1 score: 0.8139534883720929
To Predict on new data, you need the new data in the form of the training data. You then also need to apply any transformations you fit on the trianing data, to transform on the new data:
new_data = pd.DataFrame(
data = [['Haka', 'Mariehamn', 3.05, 3.66, 2.35, 3.05, 3.66, 2.52, 2.88, 3.48, 2.32]],
columns = ['Home', 'Away', 'PH', 'PD', 'PA', 'MaxH', 'MaxD', 'MaxA', 'AvgH', 'AvgD', 'AvgA']
)
to_predcit = new_data[['Home', 'Away', 'PH', 'PD', 'PA', 'MaxH', 'MaxD', 'MaxA', 'AvgH', 'AvgD', 'AvgA']]
to_predict_encoded = ohe.transform(to_predcit)
prediction = model.predict(to_predict_encoded)
prediction_prob = model.predict_proba(to_predict_encoded)
print(f'Predict: {prediction[0]} with {prediction_prob[0][0]} probability.')
Output:
Predict: 0 with 0.8204957018099501 probability.

Wrong ROC curve for multiclass classification

I have trained a CNN to classify images into 5 classes. But when I try to plot ROC curve for each class versus the rest, all 5 classes have almost a diagonal curve with AUC of around 0.5. I have no idea what has gone wrong.
The model should have an accuracy of around 86%.
Here is the code:
import os, shutil
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.keras import models, layers, optimizers
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import plot_confusion_matrix, accuracy_score
from sklearn.metrics import roc_curve, auc, roc_auc_score, RocCurveDisplay
from sklearn.preprocessing import label_binarize
import random
model = tf.keras.models.load_model('G:/Myxoid lesion/Myxoid_EN3_finetune4b')
model.summary()
data_dir='G:/Myxoid lesion/Test/'
batch_size = 64
img_height = 300
img_width = 300
test_ds = tf.keras.preprocessing.image_dataset_from_directory(
data_dir,
seed = 123,
image_size=(img_height, img_width),
batch_size=batch_size)
model.compile(optimizer = optimizers.Adam(lr=0.00002),
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics = ['sparse_categorical_accuracy'])
correct = np.array([], dtype='int32')
# Get the labels of test_ds
for x, y in test_ds:
correct = np.concatenate([correct, y.numpy()])
# Get the prediction probabilities for each class for each test image
prediction_prob = tf.nn.softmax(model.predict(test_ds))
num_class = 5
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(num_class):
fpr[i], tpr[i], _ = roc_curve(correct, prediction_prob[:,i], pos_label=i)
roc_auc[i] = auc(fpr[i], tpr[i])
plt.figure()
lw = 2
for i in range(num_class):
plt.plot(fpr[i],tpr[i],
color=(random.random(),random.random(),random.random()),
label='{0} (AUC = {1:0.2f})'''.format(labels[i], roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.legend(loc="lower right")
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC analysis')
plt.show()
The "prediction_prob" variable contains:
array([[6.3877934e-09, 6.3617526e-06, 5.5736535e-07, 4.9789862e-05,
9.9994326e-01],
[6.5260068e-08, 8.8882577e-03, 3.9350948e-06, 9.9110776e-01,
4.0252076e-11],
[2.7514220e-04, 2.9315910e-05, 1.6688553e-04, 9.9952865e-01,
3.5938730e-10],
...,
[1.1131389e-09, 9.8325908e-01, 3.4283744e-06, 1.6737511e-02,
7.3243338e-12],
[1.4697845e-08, 4.7125661e-05, 1.4077022e-03, 6.4052530e-02,
9.3449265e-01],
[9.9999940e-01, 1.3071107e-07, 4.3149896e-07, 4.7902233e-08,
9.2861301e-09]], dtype=float32)>
While the "correct" variable contains the correct label for each test image:
array([0, 1, 4, ..., 4, 2, 4])
I think I follow what is mentioned on the scikit-learn website.
The tpr[i] and fpr[i] variables generated becomes linear correlated, so the AUC becomes 0.5
I think there is a problem in generating tpr[i] and fpr[i]? Could anyone figure out the problem?
Thanks!
If I generate the labels and prediction in this way, then I can get the correct ROC curve:
prediction_prob = np.array([]).reshape(0,5)
correct = np.array([], dtype='int32')
for x, y in test_ds:
correct = np.concatenate([correct, y.numpy()])
prediction_prob = np.vstack([prediction_prob, tf.nn.softmax(model.predict(x))])
However, if I get the prediction from model.predict(test_ds), somehow the order the prediction is different from the original dataset, so that it does not match with the original label. I am not sure if this is the 'bug' in tensorflow, or there is other explanation to this.
Also I cannot get the micro-averaging (though this is not that important for my goal)
fpr["micro"], tpr["micro"], _ = roc_curve(correct.ravel(), prediction_prob.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
It gives the following error:
raise ValueError("{0} format is not supported".format(y_type))
ValueError: multiclass format is not supported

Value Error when trying to visualize an image

I´m trying to visualize some images belonging to different classes. The classes are class0,class1,class2 and they mean X-ray pictures with healthy, covid and pneumonia lungs respectively. As an example, see picture below of a covid lung:
I´ve created three datasets containing the training, test and validation data. Please, see below the code:
import pandas as pd
from keras_preprocessing.image import ImageDataGenerator
from matplotlib import pyplot as plt
import numpy as np
#Creating three dataframes reading .txt files
trainingfile = pd.read_table('data/training.txt', delim_whitespace=True, names=('class', 'image'))
testingfile = pd.read_table('data/testing.txt', delim_whitespace=True, names=('class', 'image'))
validationfile = pd.read_table('data/validation.txt', delim_whitespace=True, names=('class', 'image'))
#Change 0,1,2 to categorical class class0,class1,class2
trainingfile = trainingfile.replace([0, 1, 2], ['class0', 'class1', 'class2'])
testingfile = testingfile.replace([0, 1, 2], ['class0', 'class1', 'class2'])
validationfile = validationfile.replace([0, 1, 2], ['class0', 'class1', 'class2'])
#Final training, test and validation data
datagen=ImageDataGenerator(rescale=None)
train_generator=datagen.flow_from_dataframe(dataframe=trainingfile, directory="data/", x_col="image", y_col="class", class_mode="categorical", target_size=(256,256), batch_size=32)
test_generator=datagen.flow_from_dataframe(dataframe=testingfile, directory="data/", x_col="image", y_col="class", class_mode="categorical", target_size=(256,256), batch_size=15)
validation_generator=datagen.flow_from_dataframe(dataframe=validationfile, directory="data/", x_col="image", y_col="class", class_mode="categorical", target_size=(256,256), batch_size=21)
Now, the code to visualize one picture:
first_image = train_generator[0]
first_image = np.array(first_image, dtype='float')
pixels = first_image.reshape((28, 28))
plt.imshow(pixels, cmap='gray')
plt.show()
I get the following error:
ValueError Traceback (most recent call last)
<ipython-input-3-b237e88f96dd> in <module>
1 first_image = train_generator[0]
----> 2 first_image = np.array(first_image, dtype='float')
3 pixels = first_image.reshape((28, 28))
4 plt.imshow(pixels, cmap='gray')
5 plt.show()
ValueError: could not broadcast input array from shape (32,256,256,3) into shape (32)
Furthermore, is there any way to visualize an image corresponding to a specific class?
If instead of first_image= first_image[0], I do first_image= first_image[0][0]. Then the error that pops up is:
ValueError Traceback (most recent call last)
<ipython-input-4-0664c7dc8c6b> in <module>
1 first_image = train_generator[0][0]
2 first_image = np.array(first_image, dtype='float')
----> 3 pixels = first_image.reshape((28, 28))
4 plt.imshow(pixels, cmap='gray')
5 plt.show()
ValueError: cannot reshape array of size 6291456 into shape (28,28)

Tensorflow embedding for categorical feature

In machine learning, it is common to represent a categorical (specifically: nominal) feature with one-hot-encoding. I am trying to learn how to use tensorflow's embedding layer to represent a categorical feature in a classification problem. I have got tensorflow version 1.01 installed and I am using Python 3.6.
I am aware of the tensorflow tutorial for word2vec, but it is not very instructive for my case. While building the tf.Graph, it uses NCE-specific weights and tf.nn.nce_loss.
I just want a simple feed-forward net as below, and the input layer to be an embedding. My attempt is below. It complains when I try to matrix multiply the embedding with the hidden layer due to shape incompatibility. Any ideas how I can fix this?
from __future__ import print_function
import pandas as pd;
import tensorflow as tf
import numpy as np
from sklearn.preprocessing import LabelEncoder
if __name__ == '__main__':
# 1 categorical input feature and a binary output
df = pd.DataFrame({'cat2': np.array(['o', 'm', 'm', 'c', 'c', 'c', 'o', 'm', 'm', 'm']),
'label': np.array([0, 0, 1, 1, 0, 0, 1, 0, 1, 1])})
encoder = LabelEncoder()
encoder.fit(df.cat2.values)
X = encoder.transform(df.cat2.values)
Y = np.zeros((len(df), 2))
Y[np.arange(len(df)), df.label.values] = 1
# Neural net parameters
training_epochs = 5
learning_rate = 1e-3
cardinality = len(np.unique(X))
embedding_size = 2
input_X_size = 1
n_labels = len(np.unique(Y))
n_hidden = 10
# Placeholders for input, output
x = tf.placeholder(tf.int32, [None, 1], name="input_x")
y = tf.placeholder(tf.float32, [None, 2], name="input_y")
# Neural network weights
embeddings = tf.Variable(tf.random_uniform([cardinality, embedding_size], -1.0, 1.0))
h = tf.get_variable(name='h2', shape=[embedding_size, n_hidden],
initializer=tf.contrib.layers.xavier_initializer())
W_out = tf.get_variable(name='out_w', shape=[n_hidden, n_labels],
initializer=tf.contrib.layers.xavier_initializer())
# Neural network operations
embedded_chars = tf.nn.embedding_lookup(embeddings, x)
layer_1 = tf.matmul(embedded_chars,h)
layer_1 = tf.nn.relu(layer_1)
out_layer = tf.matmul(layer_1, W_out)
# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=out_layer, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
# Initializing the variables
init = tf.global_variables_initializer()
# Launch the graph
with tf.Session() as sess:
sess.run(init)
for epoch in range(training_epochs):
avg_cost = 0.
# Run optimization op (backprop) and cost op (to get loss value)
_, c = sess.run([optimizer, cost],
feed_dict={x: X, y: Y})
print("Optimization Finished!")
EDIT:
Please see below the error message:
Traceback (most recent call last):
File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/common_shapes.py", line 671, in _call_cpp_shape_fn_impl
input_tensors_as_shapes, status)
File "/home/anaconda3/lib/python3.6/contextlib.py", line 89, in __exit__
next(self.gen)
File "/home/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape must be rank 2 but is rank 3 for 'MatMul' (op: 'MatMul') with input shapes: [?,1,2], [2,10].
Just make your x placeholder be size [None] instead of [None, 1]

Scikit Learn: RandomForest: clf.predict works with float, but not clf.score

I'm working on a classification problem. The labels I am trying to predict:
df3['relevance'].unique()
array([ 3. , 2.5 , 2.33, 2.67, 2. , 1. , 1.67, 1.33, 1.25,
2.75, 1.75, 1.5 , 2.25])
When I call predict using the features I've made, it works OK:
clf = RandomForestClassifier()
clf.fit(df3[features], df['relevance'])
pd.crosstab(clf.predict(df3[features]), df3['relevance'])
But when I call clf.score:
clf.score(df3['features'], df3['relevance'])
I get
ValueError: continuous is not supported
Should I be classifying the relevance label I am trying to predict as another data type? Thanks for any help.
The issue you are facing happens is likely because your relevance column is made up of continuous numbers.
I would suggest switching over to the RandomForestRegressor() if you are trying to predict continuous numbers. Otherwise, convert your variables into 1s and 0s based on some threshold value.
Simply encode labels as integers and everything will work well. Floats suggest regression.
In particular you can use LabelEncoder http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
>>> from sklearn.ensemble import RandomForestClassifier as RF
>>> import numpy as np
>>> X = np.array([[0], [1], [1.2]])
>>> y = [0.5, 1.2, -0.1]
>>> clf = RF()
>>> clf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
>>> print clf.score(y, X)
Traceback (most recent call last):
[.....]
ValueError: continuous is not supported
>>> y = [0, 1, 2]
>>> clf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
>>> print clf.score(X, y)
1.0
or compute .score yourself as this is extremely trivial function
print np.mean(clf.predict(X) == y)