Handwritten Digit Recognition on MNIST dataset using sklearn - numpy

I want to build a Handwritten Digit Recognition on MNIST dataset using sklearn and I wanted to shuffle my train set for both features(x) and label(y). But it shows a KeyError. Let me know what is the correct way to do it.
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
x,y=mnist['data'],mnist['target']
x.shape
y.shape
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
digit = np.array(x.iloc[45])
digit_img = digit.reshape(28,28)
plt.imshow(digit_img,cmap=matplotlib.cm.binary , interpolation="nearest")
plt.axis("off")
y.iloc[45]
x_train, x_test = x[:60000],x[60000:]
y_train, y_test=y[:60000],y[60000:]
import numpy as np
shuffled = np.random.permutation(60000)
x_train=x_train[shuffled] -->
y_train = y_train[shuffled] --> these two lines are throwing error

Please check if type(x_train) is numpy.ndarray or DataFrame.
Since Scikit-Learn 0.24, fetch_openml() returns a Pandas DataFrame by default.
If it is dataframe, in that case you can not use x_train[shuffled], which is meant for arrays.
Instead use x_train.iloc[shuffled]

Related

How to change xtick of Yellowbrick's Learning Curve visualizer?

I'm trying to change the xtick of Yellowbrick's learning curve figure from number of samples to normalized number(%) of samples. I googled a lot but couldn't find the way.
You need to change the xticks so that they are normalized to the number of training instances thus you need to specify in the percentformatter in my example the number of training instances(55000). I provide the before and after images.
from yellowbrick.model_selection import LearningCurve
from sklearn.naive_bayes import MultinomialNB
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from yellowbrick.datasets import load_game
import matplotlib.pyplot as plt
# Create subplot
fig,ax = plt.subplots()
# Create the learning curve visualizer
sizes = np.linspace(0.3, 1.0, 10)
# Load a classification dataset
X, y = load_game()
# Encode the categorical data
X = OneHotEncoder().fit_transform(X)
y = LabelEncoder().fit_transform(y)
# Instantiate the classification model and visualizer
model = MultinomialNB()
visualizer = LearningCurve(
model, scoring='f1_weighted', ax=ax, train_sizes=sizes)
xticks = mtick.PercentFormatter(55000)
ax.xaxis.set_major_formatter(xticks)
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.show()

Cannot get reproducible results with ImageDataGenerator in keras

I am trying to get reproducible results between multiple runs of the same script in keras, but I get different ones at each iteration. My code looks like this:
import numpy as np
from numpy.random import seed
import random as rn
import os
seed_num = 1
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
os.environ['PYTHONHASHSEED'] = '1'
os.environ['TF_DETERMINISTIC_OPS'] = '1'
np.random.seed(seed_num)
rn.seed(seed_num)
import tensorflow as tf
tf.random.set_seed(seed_num)
import tensorflow.keras as ks
from tensorflow.python.keras import backend as K
...some imports...
from tensorflow.keras.preprocessing.image import ImageDataGenerator
.... data loading etc ....
generator = ImageDataGenerator(
width_shift_range=0.1,
height_shift_range=0.1,
horizontal_flip=True)
generator.fit(X_train, seed=seed_num)
my_model.fit(generator.flow(X_train, y_train, batch_size=batch_size, shuffle=False, seed=seed_num), validation_data=(X_val, y_val), callbacks=callbacks , epochs=epochs, shuffle=False)
I identified the problem to be in ImageDataGenerator, i.e., when setting generator = ImageDataGenerator() without any augmentation the results are reproducible. I am also running on CPU and TensorFlow version is 2.4.1. What am I missing here?
Using GPU while creating augmented images can produce nondeterministic results.
To get reproducible results using ImageDataGenerator and GPU, one way is the following:
import random, os
import numpy as np
import tensorflow as tf
def set_seed(seed=0):
np.random.seed(seed)
tf.random.set_seed(seed)
random.seed(seed)
os.environ['TF_DETERMINISTIC_OPS'] = "1"
os.environ['TF_CUDNN_DETERMINISM'] = "1"
os.environ['PYTHONHASHSEED'] = str(seed)
set_seed()
Before model.fit() call again set_seed():
set_seed()
model.fit(...)
Otherwise, you can install the package tensorflow-determinism:
pip install tensorflow-determinism
If you're using Google Colab, restart your runtime or it won't probably work
The package will interact with GPU to produce deterministic results.
import random, os
import numpy as np
import tensorflow as tf
def set_seed(seed=0):
os.environ['TF_DETERMINISTIC_OPS'] = '1'
random.seed(seed)
np.random.seed(seed)
tf.random.set_seed(seed)
set_seed()
# code
Also in this case, before model.fit() call again set_seed():
set_seed()
model.fit(...)

changing colorful dataset to Grayscale using keras [duplicate]

This question already has answers here:
How can I convert an RGB image into grayscale in Python?
(14 answers)
Closed 1 year ago.
I loaded a dataset from the AstroNN library. Since i believe the color of images is not a factor for classifying galaxy formations, I want to convert all the dataset to Grayscale to reduce the size of images. How should i do this to the entire dataset?
here's part of my code which loads dataset and splits it:
import matplotlib.pyplot as plt
import numpy as np
import os
import tensorflow as tf
import tensorflow.keras.layers as tfl
from astroNN.datasets import load_galaxy10
from tensorflow.keras import utils
import numpy as np
import tensorflow_datasets as tfds
from tensorflow.keras.utils import to_categorical
import h5py
import matplotlib.pyplot as plt
from matplotlib.pyplot import imread
import scipy
import pandas as pd
import math
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
from tensorflow.keras.applications.mobilenet_v2 import preprocess_input
from tensorflow.keras import layers , models
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
import lodgepole.image_tools as lit
# To load images and labels (will download automatically at the first time)
# First time downloading location will be ~/.astroNN/datasets/
images, labels = load_galaxy10()
# To convert the labels to categorical 10 classes
labels = utils.to_categorical(labels, 10)
# To convert to desirable type
labels = labels.astype(np.float32)
images = images.astype(np.float32)
#Split into train and test set
train_idx, test_idx = train_test_split(np.arange(labels.shape[0]), test_size=0.1)
train_ds, train_labels, test_ds, test_labels = images[train_idx], labels[train_idx], images[test_idx], labels[test_idx]
You can do it by getting the mean over the channel axis for each image with np.mean(array,axis=2).
I've downloaded your data and looked at it again. Since your first dimension is your batch_size, you don't want to average over the 3rd axis. You color channels on the 4th axis so we'll average on it and get the desired form.
You don't have to make them greyscale in order to train your network but if you especially want it you can go and check the following link:
How can I convert an RGB image into grayscale in Python?

Activation function with exponential distributed data

I am in beginning of neural networks, I have a bunch of targets in a regression model to predict, what I have noticed is the model works perfectly with targets were already normally distributed, but it does not work well with exponentially distributed targets, I understand this is the activation function rule, but I have been trying many functions (relu, linear, selu,elu, etc) and still didn't get a great result.
Please check the images below
Normally distributed
Exponentially distributed
That sort of makes sense, but riddle me this. Are you taking the right approach? You don't need to assume Normal distributions to do regression. It is a common misunderstanding that OLS somehow assumes normally distributed data. It does not. It is far more general. So, OLS regression makes no assumptions about the data, it makes assumptions about the errors, as estimated by residuals. Also, transforming data to make in fit a model is, in my opinion, the wrong approach. You want your model to fit your problem, not the other way round. There are a few ways to deal with skewed data sets.
1. Normalize Data
2. Standardize Data
Let's see an example.
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
##iris.keys()
df= pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
# Normalize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
import seaborn as sns
# load the iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data from the target attributes
X = df[['sepal length (cm)','sepal width (cm)','petal length (cm)','petal width (cm)']]
y = df['species']
sns.displot(X)
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
sns.displot(normalized_X)
# Standardize the data attributes for the Iris dataset.
from sklearn.datasets import load_iris
from sklearn import preprocessing
# load the Iris dataset
iris = load_iris()
print(iris.data.shape)
# separate the data and target attributes
X = X
y = y
sns.displot(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)
sns.displot(standardized_X)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import datasets
iris = datasets.load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[iris.feature_names], iris.target, test_size=0.5, stratify=iris.target, random_state=123456)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=123456)
rf.fit(X_train, y_train)
from sklearn.metrics import accuracy_score
predicted = rf.predict(X_test)
accuracy = accuracy_score(y_test, predicted)
print(f'Out-of-bag score estimate: {rf.oob_score_:.3}')
print(f'Mean accuracy score: {accuracy:.3}')
Result:
Out-of-bag score estimate: 0.96
Mean accuracy score: 0.933
See the link below for specific info on these concepts.
https://machinelearningmastery.com/rescaling-data-for-machine-learning-in-python-with-scikit-learn/
Again, though, maybe this is not the right approach. For one thing, you can try to apply a different model to your specific data set. Support Vector Machine algos just cares about the boundaries of the separating hyperplane and do not assume the exact shape of the distributions. One of my favorite models in the the Decision Tree family, specifically, the Random Forest model.
Also, see this link,
https://www.blopig.com/blog/2017/07/using-random-forests-in-python-with-scikit-learn/

Getting TypeError while training a classifier for iris flower dataset

I am trying to experiment by taking the output layer as a linear layer for classifying the iris flower dataset and use regression ,with target values
ranging from 0,1 and 2.
I am using 1 hidden tanh activation layer and the another linear layer. I have by motive tried using this instead of one hot encoding for the labels as I want to compare the score from the 'model' function of my code as I am new to tensorflow .On running below code...
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import tensorflow as tf
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
data=load_iris()
X=data['data']
Y=data['target']
pca=PCA(n_components=2)
X=pca.fit_transform(X)
#visualise the data
#plt.figure(figsize=(12,12))
#plt.scatter(X[:,0],X[:,1],c=Y,alpha=0.4)
#plt.show()
labels=Y.reshape(-1,1)
x_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.3,random_state=42)
y_train=y_train.reshape(-1,1)
y_test=y_test.reshape(-1,1)
hidden_nodes=5
batch_size=100
num_features=2
lr=0.01
g=tf.Graph()
with g.as_default():
tf_train_dataset=tf.placeholder(tf.float32,shape=[None,num_features])
tf_train_labels=tf.placeholder(tf.float32,shape=[None,1])
tf_test_dataset=tf.constant(x_test,dtype=tf.float32)
layer1_weights=tf.Variable(tf.truncated_normal([num_features,hidden_nodes]),dtype=tf.float32)
layer1_biases=tf.Variable(tf.zeros([hidden_nodes]),dtype=tf.float32)
layer2_weights=tf.Variable(tf.truncated_normal([hidden_nodes,1]),dtype=tf.float32)
layer2_biases=tf.Variable(tf.zeros([1]),dtype=tf.float32)
def model(data):
Z1=tf.matmul(data,layer1_weights)+layer1_biases
A1=tf.nn.relu(Z1)
Z2=tf.matmul(A1,layer2_weights)+layer2_biases
return Z2
model_scores=model(tf_train_dataset)
loss=tf.reduce_mean(tf.losses.mean_squared_error(model_scores,tf_train_labels))
optimizer=tf.train.GradientDescentOptimizer(lr).minimize(loss)
#train_prediction=model_scores
test_prediction=(tf_test_dataset)
num_steps=10001
with tf.Session() as sess:
init=tf.global_variables_initializer()
sess.run(init)
for step in range(num_steps):
offset=(step*batch_size)%(y_train.shape[0]-batch_size)
minibatch_data=x_train[offset:(offset+batch_size),:]
minibatch_labels=y_train[offset:(offset+batch_size)]
feed_dict={tf_train_dataset:minibatch_data,tf_train_labels:minibatch_labels}
ll,loss,scores=sess.run([optimizer,loss,model_scores],feed_dict=feed_dict)
if step%1000==0:
print('Minibatch loss at step {}:{}'.format(step,loss))
I get an error on line
ll,loss,scores=sess.run([optimizer,loss,model_scores],feed_dict=feed_dict)
TypeError: Fetch argument 14.686994 has invalid type , must be a string or Tensor. (Can not convert a float32 into a Tensor or Operation.)
Why is error coming, is it because of this line
model_scores=model(tf_train_dataset)
How should I go about solving this issue and can't the return value of model function be tensor or casted to tensor.
Thanks.
That is because of this line:
ll,loss,scores=sess.run([optimizer,loss,model_scores],feed_dict=feed_dict)
You replace loss tensor with loss value returned by sess.run. Just use a different variable to store loss value.