How to change xtick of Yellowbrick's Learning Curve visualizer? - matplotlib

I'm trying to change the xtick of Yellowbrick's learning curve figure from number of samples to normalized number(%) of samples. I googled a lot but couldn't find the way.

You need to change the xticks so that they are normalized to the number of training instances thus you need to specify in the percentformatter in my example the number of training instances(55000). I provide the before and after images.
from yellowbrick.model_selection import LearningCurve
from sklearn.naive_bayes import MultinomialNB
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from yellowbrick.datasets import load_game
import matplotlib.pyplot as plt
# Create subplot
fig,ax = plt.subplots()
# Create the learning curve visualizer
sizes = np.linspace(0.3, 1.0, 10)
# Load a classification dataset
X, y = load_game()
# Encode the categorical data
X = OneHotEncoder().fit_transform(X)
y = LabelEncoder().fit_transform(y)
# Instantiate the classification model and visualizer
model = MultinomialNB()
visualizer = LearningCurve(
model, scoring='f1_weighted', ax=ax, train_sizes=sizes)
xticks = mtick.PercentFormatter(55000)
ax.xaxis.set_major_formatter(xticks)
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.show()

Related

Linear regression with one feature from Pandas dataframe

I have tried the code below
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
# Assign the dataframe to this variable.
# TODO: Load the data
bmi_life_data = pd.read_csv("bmi_and_life_expectancy.csv")
X= bmi_life_data['BMI'].values.reshape(-1,1)
y = bmi_life_data['Life expectancy'].values.reshape(-1,1)
# Make and fit the linear regression model
#TODO: Fit the model and Assign it to bmi_life_model
bmi_life_model = LinearRegression()
bmi_life_model.fit(X,y)
# Mak a prediction using the model
# TODO: Predict life expectancy for a BMI value of 21.07931
laos_life_exp = bmi_life_model.predict(21.07931)
but it gives me the error
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Even after reshaping it. I have tried to not reshape it but it still gives me the same error.
The error was in the prediction line
laos_life_exp = bmi_life_model.predict(21.07931)
should be
laos_life_exp = bmi_life_model.predict([[21.07931]])
to be of appropriate dimension
Thanks to #onyambu

How to include or deleat completely the figure axis using Sklearn ConfusionMatrixDisplay?

I am using the below code to generate a confusion matrix using the Sklearn library. But while saving the image the y-axis label i.e. True label is not printed completely. It is shown . In the python console, it is printed okay, But I need a high-resolution image and hence I need to save the image. Also, the publisher wants only tiff or pdf format.
disp=ConfusionMatrixDisplay(confusion_matrix=cm1,display_labels=['anger','bordome','disgust','fear', 'happiness','sadness' ,'neutral'])
font={'size':'30'}
plt.rc('font',**font)
plt.rcParams['figure.figsize']=[20,20]
disp.plot(cmap='Blues',values_format='0.2f')
plt.xticks(rotation=45)
plt.savefig("Fig.5.tif",dpi=30)
plt.show()
Also can I remove both the axis labels somehow? As that would also solve my problem.
Thanks
The picture is a matplotlib plot. So, to remove the ticks for each axis and the labels, you can use set_ticks([]) which will remove both. I am using the sample from here to create a confusion matrix. The get_gca() will get the axis for the plot. Setting the set_ticks to blank will remove the ticks and labels. Of course the bbox_inches=tight will be required as mentioned by #endive1783. The code is here. I have made the plot a little smaller than 30x30, but should work for any size.
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
clf = SVC(random_state=0)
clf.fit(X_train, y_train)
SVC(random_state=0)
predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=clf.classes_)
font={'size':'15'}
plt.rc('font',**font)
plt.rcParams['figure.figsize']=[6,6]
disp.plot(cmap='Blues',values_format='0.2f')
#plt.xticks(rotation=45)
#plt.savefig("Fig.5.tif",dpi=30)
plt.gca().axes.get_xaxis().set_ticks([])
plt.gca().axes.get_yaxis().set_ticks([])
plt.savefig("Fig.5.tif",dpi=30, bbox_inches = 'tight')
plt.show()
plot
Update to keep tick labels, but remove axes labels
Replace the two set_ticks()rows with below.
plt.gca().xaxis.label.set_visible(False)
plt.gca().yaxis.label.set_visible(False)

How to get samples per class for TensorFlow Dataset

I am using dataset from TensorFlow datasets.
Is there an easy way to access number of samples for each class in dataset? I was searching through keras api, and I did not found any ready to use function.
Ultimately I would like to plot a bar plot with number of samples at Y axis, and int indicating class id at X axis. The goal is to show how evenly is data distributed across classes.
With np.fromiter you can create a 1-D array from an iterable object.
import tensorflow_datasets as tfds
import numpy as np
import seaborn as sns
dataset = tfds.load('cifar10', split='train', as_supervised=True)
labels, counts = np.unique(np.fromiter(dataset.map(lambda x, y: y), np.int32),
return_counts=True)
plt.ylabel('Counts')
plt.xlabel('Labels')
sns.barplot(x = labels, y = counts)
Update: You can also count the labels like below:
labels = []
for x, y in dataset:
# Not one hot encoded
labels.append(y.numpy())
# If one hot encoded, then apply argmax
# labels.append(np.argmax(y, axis = -1))
labels = np.concatenate(labels, axis = 0) # Assuming dataset was batched.
Then you can plot them using the labels array.

Activation function with exponential distributed data

I am in beginning of neural networks, I have a bunch of targets in a regression model to predict, what I have noticed is the model works perfectly with targets were already normally distributed, but it does not work well with exponentially distributed targets, I understand this is the activation function rule, but I have been trying many functions (relu, linear, selu,elu, etc) and still didn't get a great result.
Please check the images below
Normally distributed
Exponentially distributed
That sort of makes sense, but riddle me this. Are you taking the right approach? You don't need to assume Normal distributions to do regression. It is a common misunderstanding that OLS somehow assumes normally distributed data. It does not. It is far more general. So, OLS regression makes no assumptions about the data, it makes assumptions about the errors, as estimated by residuals. Also, transforming data to make in fit a model is, in my opinion, the wrong approach. You want your model to fit your problem, not the other way round. There are a few ways to deal with skewed data sets.
1. Normalize Data
2. Standardize Data
Let's see an example.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
##iris.keys()
df= pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
# Normalize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
import seaborn as sns
# load the iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data from the target attributes
X = df[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']]
y = df['species']
sns.displot(X)
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
sns.displot(normalized_X)
# Standardize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
# load the Iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data and target attributes
X = X
y = y
sns.displot(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)
sns.displot(standardized_X)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[iris.feature_names], iris.target, test_size=0.5, stratify=iris.target, random_state=123456)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=123456)
rf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
predicted = rf.predict(X_test)
accuracy = accuracy_score(y_test, predicted)
print(f'Out-of-bag score estimate: {rf.oob_score_:.3}')
print(f'Mean accuracy score: {accuracy:.3}')
Result:
Out-of-bag score estimate: 0.96
Mean accuracy score: 0.933
See the link below for specific info on these concepts.
https://machinelearningmastery.com/rescaling-data-for-machine-learning-in-python-with-scikit-learn/
Again, though, maybe this is not the right approach. For one thing, you can try to apply a different model to your specific data set. Support Vector Machine algos just cares about the boundaries of the separating hyperplane and do not assume the exact shape of the distributions. One of my favorite models in the the Decision Tree family, specifically, the Random Forest model.
Also, see this link,
https://www.blopig.com/blog/2017/07/using-random-forests-in-python-with-scikit-learn/

How do I plot a non-linear model using matplotlib?

I'm a bit lost as to how to proceed to achieve this. Normally with a linear model, when I perform linear regressions, I simply take my training data (x) and and my output data (y) and plot them using matplotlib. Now I have 3 features with and my output/observation (y). Can anyone guide me as to how to graph this kind of model using matplotlib? My goal is to fit a polynomial model and graph a polynomial using matplotlib.
%matplotlib inline
import sframe as frame
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
# Initalize SFrame
sales = frame.SFrame('kc_house_data.gl/')
# Separate data into test and training data
train_data,test_data = sales.random_split(.8,seed=0)
# Organize data into training and testing data
train_x = train_data[['sqft_living', 'bedrooms', 'bathrooms']].to_dataframe().values
train_y = train_data[['price']].to_dataframe().values
test_x = test_data[['sqft_living', 'bedrooms', 'bathrooms']].to_dataframe().values
test_y = test_data[['price']].to_dataframe().values
# Create a model using sklearn with multiple features
regr = linear_model.LinearRegression(fit_intercept=True, n_jobs=2)
# test predictions
regr.predict(train_x)
# Prepare to plot the data
Note:
The train_x variable contains my 3 features, and my train_y contains the output data. I use SFrame to contain the data. SFrame has the ability to convert itself into a dataframe (used in Pandas). Using the conversion I am able to grab the values.
Rather than plotting a non-linear model with multiple discrete features at once, I have found that simply observing each and every feature against my observation/output was better and easier for my research.