Can't visualize plotted Confusion Matrix - pandas

I am new to ML and learning the fundamentals.
I am working on Dog-vision dataset (https://www.kaggle.com/c/dog-breed-identification) and I am trying to plot a confusion matrix but can't get where I am doing wrong, need help!
My true_label looks like this
true_label[:10]
array([26, 96, 8, 15, 3, 10, 62, 82, 92, 16]
And predicted_label looks like this
predicted_l[:10]
array([26, 96, 8, 15, 3, 10, 62, 82, 92, 16]
They are almost same but not the whole elements in the array are same.
Then I had converted them into a panda dataframe, with code like this
import pandas as pd
from sklearn.metrics import confusion_matrix
classes=[]
for i in range(0, 99):
classes.append(i)
cf_matrix = confusion_matrix(true_l, predicted_l)
cf_matrix_df = pd.DataFrame(cf_matrix, index=classes,columns=classes)
cf_matrix_df
And then the output is like this-
Then I tried to plot the confusion matrix with this dataframe
but it's not being plotted in correct manner. Here is the code and the output of my confusion matrix:-
import seaborn as sns
figure = plt.figure(figsize=(8, 8))
sns.heatmap(cf_matrix_df, annot=True,cmap=plt.cm.Blues)
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()
Output
If you need more info then please have a look on my notebook here.
https://colab.research.google.com/drive/1SoXJJNTnGx39uZHizAut-HuMtKhQQolk?usp=sharing

You can make your plot better by removing annot=True argument, since it writes the data value in each cell. Simply remove this argument to get a better visualization:
sns.heatmap(cf_matrix_df, cmap=plt.cm.Blues)
UPDATE: Increasing the figure size figsize() will help to make visualization more clearer.

Related

Matplotlib Syntax Issue

I'm trying to plot a simple graph using matplotlib, but nothing is working. I've followed the documentation but it still spits out this regardless of what i do.
plt.plot(Time, raw, colour="blue")
^SyntaxError: invalid syntax
Try the following:
import matplotlib.pyplot as plt
# set test data
time = [1, 2, 3]
raw = [3, 2, 1]
plt.plot(time, raw, color="blue")
Returns

Ticks position in heatmap with categorical data (seaborn)

I am trying to plot a confusion matrix of my predictions. My data is multi-class (13 different labels) so I'm using a heatmap.
As you can see below, my heat map looks generally okay but the labels are a bit out of position: y ticks should be a little lower and x ticks should be a bit more to the right. I want to move both axis ticks a bit so they will aligned with the center of each square.
my code:
sns.set()
my_mask = np.zeros((con_matrix.shape[0], con_matrix.shape[0]), dtype=int)
for i in range(con_matrix.shape[0]):
for j in range(con_matrix.shape[0]):
my_mask[i][j] = con_matrix[i][j] == 0
fig_dims = (10, 10)
plt.subplots(figsize=fig_dims)
ax = sns.heatmap(con_matrix, annot=True, fmt="d", linewidths=.5, cmap="Pastel1", cbar=False, mask=my_mask, vmax=15)
plt.xticks(range(len(party_names)), party_names, rotation=45)
plt.yticks(range(len(party_names)), party_names, rotation='horizontal')
plt.show()
and for reproduction purpose, here are con_matrix and party_names hard-coded:
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
con_matrix = np.array([[55, 0, 0, 0,0, 0, 0,0,0,0,0,0,2], [0,199,0,0,0,0,0,0,0,0,2,0,1],
[0, 0,52,0,0,0,0,0,0,0,0,0,1],
[0,0,0,39,0,0,0,0,0,0,0,0,0],
[0,0,0,0,90,0,0,0,0,0,0,4,3],
[0,0,0,1,0,35,0,0,0,0,0,0,0],
[0,0,0,0,5,0,26,0,0,1,0,1,0],
[0,5,0,0,0,1,0,44,0,0,3,0,1],
[0,1,0,0,0,0,0,0,52,0,0,0,0],
[0,1,0,0,2,0,0,0,0,235,0,1,1],
[1,2,0,0,0,0,0,3,0,0,34,0,3],
[0,0,0,0,5,0,0,0,0,1,0,40,0],
[0,0,0,0,0,0,0,0,0,1,0,0,46]])
party_names = ['Blues', 'Browns', 'Greens', 'Greys', 'Khakis', 'Oranges', 'Pinks', 'Purples', 'Reds', 'Turquoises', 'Violets', 'Whites', 'Yellows']
I already tried to work with position argument of different axes, but it did not turn out well. Could not find an exactly answer in this site as well (at least not a solution that works for categorical data).
I'm new in visualization with seaborn, any improvement with explanations would be appreciated (not only for my problem but on my code & visualization as well).
You can shift both the ticklabels by 0.5 offset to have the desired alignment. To do so, I have used NumPy's arange that enables vectorized addition of 0.5 to the whole array.
plt.xticks(np.arange(len(party_names))+0.5, party_names, rotation=45)
plt.yticks(np.arange(len(party_names))+0.5, party_names, rotation='horizontal')

How to manually scale a continuous legend in a seaborn scatterplot?

I'm creating a scatterplot with seaborn like this:
plt.figure(figsize=(20,5))
ax = sns.scatterplot(x=x,
y=y,
hue=errors,
s=errors*20,
alpha=0.8,
edgecolors='w')
ax.set(xlabel='X', ylabel='Y')
ax.legend(title="Error (m)", loc='upper right')
My errors contain values between approximately 0.1 and 12.5. However, for my legend seaborn automatically generates labels 0, 5, 10, 15. This makes my algorithm look worse than it is. I would like to change the step size in the legend while maintaining a correct mapping between colors and error magnitudes. For example 0, 4, 8, 12.5. Is this possible?

How can I specify multiple variables for the hue parameters when plotting with seaborn?

When using seaborn, is there a way I can include multiple variables (columns) for the hue parameter? Another way to ask this question would be how can I group my data by multiple variables before plotting them on a single x,y axis plot?
I want to do something like below. However currently I am not able to specify two variables for the hue parameter.:
sns.relplot(x='#', y='Attack', hue=['Legendary', 'Stage'], data=df)
For example, assume I have a pandas DataFrame like below containing an a Pokemon database obtained via this tutorial.
I want to plot on the x-axis the pokedex #, and the y-axis the Attack. However, I want to data to be grouped by both Stage and Legendary. Using matplotlib, I wrote a custom function that groups the dataframe by ['Legendary','Stage'], and then iterates through each group for the plotting (see results below). Although my custom function works as intended, I was hoping this can be achieved simply by seaborn. I am guessing there must be other people what have attempted to visualize more than 3 variables in a single plot using seaborn?
fig, ax = plt.subplots()
grouping_variables = ['Stage','Legendary']
group_1 = df.groupby(grouping_variables)
for group_1_label, group_1_df in group_1:
ax.scatter(group_1_df['#'], group_1_df['Attack'], label=group_1_label)
ax_legend = ax.legend(title=grouping_variables)
Edit 1:
Note: In the example I provided, I grouped the data by obly two variables (ex: Legendary and Stage). However, other situations may require arbitrary number of variables (ex: 5 variables).
You can leverage the fact that hue accepts either a column name, or a sequence of the same length as your data, listing the color categories to assign each data point to. So...
sns.relplot(x='#', y='Attack', hue='Stage', data=df)
... is basically the same as:
sns.relplot(x='#', y='Attack', hue=df['Stage'], data=df)
You typically wouldn't use the latter, it's just more typing to achieve the same thing -- unless you want to construct a custom sequence on the fly:
sns.relplot(x='#', y='Attack', data=df,
hue=df[['Legendary', 'Stage']].apply(tuple, axis=1))
The way you build the sequence that you pass via hue is entirely up to you, the only requirement is that it must have the same length as your data, and if an array-like, it must be one-dimensional, so you can't just pass hue=df[['Legendary', 'Stage']], you have to somehow concatenate the columns into one. I chose tuple as the simplest and most versatile way, but if you want to have more control over the formatting, build a Series of strings. I'll save it into a separate variable here for better readability and so that I can assign it a name (which will be used as the legend title), but you don't have to:
hue = df[['Legendary', 'Stage']].apply(
lambda row: f"{row.Legendary}, {row.Stage}", axis=1)
hue.name = 'Legendary, Stage'
sns.relplot(x='#', y='Attack', hue=hue, data=df)
To use hue of seaborn.relplot, consider concatenating the needed groups into a single column and then run the plot on new variable:
def run_plot(df, flds):
# CREATE NEW COLUMN OF CONCATENATED VALUES
df['_'.join(flds)] = pd.Series(df.reindex(flds, axis='columns')
.astype('str')
.values.tolist()
).str.join('_')
# PLOT WITH hue
sns.relplot(x='#', y='Attack', hue='_'.join(flds), data=random_df, aspect=1.5)
plt.show()
plt.clf()
plt.close()
To demonstrate with random data
Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
### DATA
np.random.seed(22320)
random_df = pd.DataFrame({'#': np.arange(1,501),
'Name': np.random.choice(['Bulbasaur', 'Ivysaur', 'Venusaur',
'Charmander', 'Charmeleon'], 500),
'HP': np.random.randint(1, 100, 500),
'Attack': np.random.randint(1, 100, 500),
'Defense': np.random.randint(1, 100, 500),
'Sp. Atk': np.random.randint(1, 100, 500),
'Sp. Def': np.random.randint(1, 100, 500),
'Speed': np.random.randint(1, 100, 500),
'Stage': np.random.randint(1, 3, 500),
'Legend': np.random.choice([True, False], 500)
})
Plots
run_plot(random_df, ['Legend', 'Stage'])
run_plot(random_df, ['Legend', 'Stage', 'Name'])
In seaborn's scatterplot(), you can combine both a hue= and a style= parameter to produce different markers and different colors for each combinations
example (taken verbatim from the documentation):
tips = sns.load_dataset("tips")
ax = sns.scatterplot(x="total_bill", y="tip", data=tips)
ax = sns.scatterplot(x="total_bill", y="tip",
hue="day", style="time", data=tips)

imshow inverts colors after using astype(float)

I'm using matplotlib imshow to visualize data from cifar-10. After reading in cifar10 data I've noticed that the image rendered from imshow is different after I use .astype(float).
For example,
Without .astype(float)
Here's what I see with .astype(float)
Why does it look like the image is rendering with the colors inverted?
Here is the code I am using:
dir = 'resources/datasets/cifar-10-batches-py'
import cPickle
fo = open(dir + '/data_batch_1', 'rb')
dict = cPickle.load(fo)
fo.close()
X=dict['data'].reshape((10000, 3, 32, 32)).transpose(0, 2, 3, 1).astype(float)
Y=dict['labels']
plt.imshow(X[2,])
plt.show()
A little late but for people seeking help and finding this post:
A good explanation of how matplotlib decides on the colormap is given in this post.
I have just faced this same problem and found a solution in this post. It seems that matplotlib assumes 3d images of type float to be in range [0,1] and uses a modulo operation with 1 to clamp the values like 2.7 -> 0.7.
Try this:
X=dict['data'].reshape((10000, 3, 32, 32)).transpose(0, 2, 3,1)
X = X.astype(float) / 255
plt.imshow(X[2,])
Now matplotlib should directly use the values provided and not use internal modulo magic.