How to select almost equally distributed classes in training,validation,test samples? - pandas

I am working on MNIST Sign Language dataset to classify images using Keras. There are 24 different classes in the dataset. But the problem is that the distribution of classes of very different.
I used sklearn.model_selection.train_test_split to stratify=df['label'] but still some classes have 5% while others have 3% of the whole data. How can I make them to choose a data that is around 4% distributed among the classes.
My test_df has 7172 rows and 785 columns one of which is a label column and remaining 784 are grayscale pixel values (28*28)
test_df = pd.read_csv(TEST_PATH)
# shuffle and split validation,test data
test_df = test_df.sample(frac=1.0,random_state=SEED).iloc[:2000,:] # shuffle the whole data, get first 2000 rows
val_df,test_df = train_test_split(test_df,test_size=0.5,random_state=SEED,stratify=test_df['label'])
# stratify the labels so that distribution of classes is almost same
# extract pixels and labels for both validation,test data
X_val = val_df.drop('label',axis=1).values.reshape((val_df.shape[0],28,28))/255.0 # validation images
y_val = val_df['label'].ravel() # validation label
X_test = test_df.drop('label',axis=1).values.reshape((test_df.shape[0],28,28))/255.0 # test images
y_val = test_df['label'].ravel() # test label

this line enables you to have a uniform distribution with val and test. you can play also with the number of samples
SEED = 42
n_classes = 24
test_df = pd.read_csv(TEST_PATH)
test_df = [test_df.loc[test_df.label==i].sample(n=int(2000/n_classes),random_state=SEED) for i in test_df.label.unique()]
test_df = pd.concat(test_df, axis=0, ignore_index=True)
val_df,test_df = train_test_split(test_df,test_size=0.5,random_state=SEED,stratify=test_df['label'])

Related

What to do if the padding is of different length on test data in NLP?

I am having a confusion right now regarding the maxlen in pad_sequences(). Currently I am doing Fake News Classification. My question is while processing the test data what is the maxlen is longer than the training data. Let me give you my situation.
My code snippet:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text)
text_tokenizer = tokenizer.texts_to_sequences(text)
len(text_tokenizer) # Getting 18285
max([len(x) for x in text_tokenizer])
I am getting maxlen = 51.
text_tokenizer_pad = pad_sequences(text_tokenizer, maxlen=max([len(x) for x in text_tokenizer]))
I am doing padding based on this maxlen size which is 51.
If I see the shape of text_tokenizer_pad, I get (18285, 51)
text_tokenizer_pad.shape
This is what I did in training part.
Let me show you what I did for testing part.
text_test_tokenizer = tokenizer.texts_to_sequences(text_test)
As u can see I didn't fit_on_texts() on test data.
len(text_test_tokenizer) # Gives 4575
max([len(x) for x in text_test_tokenizer])
This gives maxlen as 43, which is less than 51. Fine.
Now I am padding the test data with the same length as training data i.e. 51 because it is longer than test data length and I used in train_test_split and in training purpose. So the columns will be 51 there while training.
text_test_tokenizer_pad = pad_sequences(text_test_tokenizer, maxlen=max([len(x) for x in text_tokenizer]))
After that I am predicting. I am 2 questions regarding all of this-
Am I doing this correctly, I mean the tokenize and padding part on train data and test data ?
What if the maxlen for test data is longer than training data, what value should I choose then, because text_tokenize_pad is the one I used in train_test_split
x_train, x_test, y_train, y_test = train_test_split(text_tokenizer_pad, labels, test_size = 0.2, shuffle=True, random_state=42)
I tried with normal constant value say choose maxlen as 100 for both train and test. But this is not a perfect solution. There can always be a test data longer than 100.

Keras data generator predict same number of values

I have implemented a CNN-based regression model that uses a data generator to use the huge amount of data I have. Training and evaluation work well, but there's an issue with the prediction. If for example I want to predict values from a test dataset of 50 samples, I use model.predict with a batch size of 5. The problem is that model.predict returns 5 values repeated 10 times, instead of 50 different values . The same thing happens if I change to batch size to 1, it will return one value 50 times.
To solve this issue, I used a full batch size (50 in my example), and it worked. But I can't I use this method on my whole test data because it's too huge.
Do you have any other solution, or what is the problem in my approach?
My data generator code:
import numpy as np
import keras
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, list_IDs, data_X, data_Z, target_y batch_size=32, dim1=(120,120),
dim2 = 80, n_channels=1, shuffle=True):
'Initialization'
self.dim1 = dim1
self.dim2 = dim2
self.batch_size = batch_size
self.data_X = data_X
self.data_Z = data_Z
self.target_y = target_y
self.list_IDs = list_IDs
self.n_channels = n_channels
self.shuffle = shuffle
self.on_epoch_end()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(len(self.list_IDs) / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Find list of IDs
list_IDs_temp = [self.list_IDs[k] for k in range(len(indexes))]
# Generate data
([X, Z], y) = self.__data_generation(list_IDs_temp)
return ([X, Z], y)
def on_epoch_end(self):
'Updates indexes after each epoch'
self.indexes = np.arange(len(self.list_IDs))
if self.shuffle == True:
np.random.shuffle(self.indexes)
def __data_generation(self, list_IDs_temp):
'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
# Initialization
X = np.empty((self.batch_size, *self.dim1, self.n_channels))
Z = np.empty((self.batch_size, self.dim2))
y = np.empty((self.batch_size))
# Generate data
for i, ID in enumerate(list_IDs_temp):
# Store sample
X[i,] = np.load('data/' + data_X + ID + '.npy')
Z[i,] = np.load('data/' + data_Z + ID + '.npy')
# Store target
y[i] = np.load('data/' + target_y + ID + '.npy')
How I call model.predict()
predict_params = {'list_IDs': 'indexes',
'data_X': 'images',
'data_Z': 'factors',
'target_y': 'True_values'
'batch_size': 5,
'dim1': (120,120),
'dim2': 80,
'n_channels': 1,
'shuffle'=False}
# Prediction generator
prediction_generator = DataGenerator(test_index, **predict_params)
predition_results = model.predict(prediction_generator, steps = 1, verbose=1)
If we look at your __getitem__ function, we can see this code:
list_IDs_temp = [self.list_IDs[k] for k in range(len(indexes))]
This code will always return the same numbers IDs, because the length len of the indexes is always the same (at least as long as all batches have an equal amount of samples) and we just loop over the first couple of indexes every time.
You are already extracting the indexes of the current batch beforehand, so the line with the error is not needed at all. The following code should work:
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
list_IDs_temp = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
# Generate data
([X, Z], y) = self.__data_generation(list_IDs_temp)
return ([X, Z], y)
See if this code works and you get different results. You should now get bad predictions, because during training, your model would also only have trained on the same few data points as of now.
When you use a generator you specify a batch size. model.predict will produce batch size number of output predictions. If you set steps=1 that is all the predictions you will get. To set the steps you should take the number of samples you have and divide it by the batch size. For example if you have 50 images with a batch size of 5 then you should set steps equal to 10. Ideally you want to go through your test set exactly once. The code below will determine the batch size and steps to do that. In the code b_max is a value you select that limits the maximum batch size . You should set this based on your memory size to avoid a OOM (out of memory) error. In the code below parameter length is equal to the number of test samples you have.
length=500
b_max=80
batch_size=sorted([int(length/n) for n in range(1,length+1) if length % n ==0 and length/n<=b_max],reverse=True)[0]
steps=int(length/batch_size)
The result will be batch_size= 50 steps=10. Note if length is a prime number the result will be batch_size=1 and steps=length
According to this solution, you need to change your steps to the total number of images you want to test on. Try:
# Assuming test_index is a list
predition_results = model.predict(prediction_generator, steps = len(test_index), verbose=1)

Applying Keras Model to give an output value for each row

I am learning keras and would like to understand how I can apply a classifier (sequential) to all rows in my data set and not just the x% left for test validation.
The confusion I am having is, when I define my data split, I will have a portion for train and test. How would I apply model to full data set to show me the predicted values for each row? The end goal I have is to produce an concatenate the predicted value for every customer in the data set.
dataset = pd.read_csv('BankCustomers.csv')
X = dataset.iloc[:, 3:13]
y = dataset.iloc[:, 13]
feature_train, feature_test, label_train, label_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
sc = StandardScaler()
feature_train = sc.fit_transform(feature_train)
feature_test = sc.transform(feature_test)
For completeness the classifier looks like below.
# Initialising the ANN
classifier = Sequential()
# Adding the input layer and the first hidden layer
classifier.add(Dense(activation="relu", input_dim=11, units=6, kernel_initializer="uniform"))
# Adding the second hidden layer
classifier.add(Dense(activation="relu", units=6, kernel_initializer="uniform"))
# Adding the output layer
classifier.add(Dense(activation="sigmoid", units=1, kernel_initializer="uniform"))
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
# Fitting the ANN to the Training set
classifier.fit(feature_train, label_train, batch_size = 10, nb_epoch = 100)
The course I am doing will suggest ways to get accuracy and predictions for the test set like below, but not the full batch.
# Predicting the Test set results
label_pred = classifier.predict(feature_test)
label_pred = (label_pred > 0.5) # FALSE/TRUE depending on above or below 50%
cm = confusion_matrix(label_test, label_pred)
accuracy=accuracy_score(label_test,label_pred)
I tried concatenating the model applied to both training and test data, but i then was unsure how to ascertain which index matched up with the original data set (i.e. I don't know which of the 20% test data is relative to the original set).
I apologise in advance if this question is superfluous, I have been looking for answers on stack and via the course but so far no luck.
You can utilize pandas indexes to sort your results back to the original order.
Predict on each feature_train and feature_test (not sure why you'd want to predict on feature_train though.)
Add a new column to each feature_train and feature_test, which would contain the predictions
feature_train["predictions"] = pd.Series(classifier.predict(feature_train))
feature_test["predictions"] = pd.Series(classifier.predict(feature_test))
If you look at the indexes of each data frame above, you can see they're shuffled (because of the train_test_split).
You can now concatenate them, use sort_index, and retrieve the predictions column, which would have the predictions according to the order that appeared in your initial dataframe (X)
pd.concat([feature_train, feature_test], axis=0).sort_index()

Look up BernoulliNB Probability in Dataframe

I have some training data (TRAIN) and some test data (TEST).
Each row of each dataframe contains an observed class (X) and some columns of binary (Y). BernoulliNB predicts the probability of X given Y in the test data based on the training data. I am trying to look up the probability of the observed class of each row in the test data (Pr).
Edit: I used Antoine Zambelli's advice to fix the code:
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
# Training Data
TRAIN = pd.DataFrame({'X' : [1,2,3,9],
'Y1': [1,1,0,0],
'Y4': [1,0,0,0]})
# Test Data
TEST = pd.DataFrame({'X' : [5,0,1,1,1,2,2,2,2],
'Y1': [1,1,0,1,0,1,0,0,0],
'Y2': [1,0,1,0,1,0,1,0,1],
'Y3': [1,1,0,1,1,0,0,0,0],
'Y4': [1,1,0,1,1,0,0,0,0]})
# Add the information that TRAIN has none of the missing items
diff_cols = set(TEST.columns)-set(TRAIN.columns)
for i in diff_cols:
TRAIN[i] = 0
# Split the data
Se_Tr_X = TRAIN['X']
Se_Te_X = TEST ['X']
df_Tr_Y = TRAIN .drop('X', axis=1)
df_Te_Y = TEST .drop('X', axis=1)
# Train: Bernoulli Naive Bayes Classifier
A_F = BNB.fit(df_Tr_Y, Se_Tr_X)
# Test: Predict Probability
Ar_R = BNB.predict_proba(df_Te_Y)
df_R = pd.DataFrame(Ar_R)
# Rename the columns after the classes of X
df_R.columns = BNB.classes_
df_S = df_R .join(TEST)
# Look up the predicted probability of the observed X
# Skip X's that are not in the training data
def get_lu(df):
def lu(i, j):
return df.get(j, {}).get(i, np.nan)
return lu
df_S['Pr'] = [*map(get_lu(df_R), df_S .T, df_S .X)]
This seemed to work, giving me the result (df_S):
This correctly gives a "NaN" for the first 2 rows because the training data contains no information about classes X=5 or X=0.
Ok, there's a couple issues here. I have a full working example below, but first those issues. Mainly the assertion that "This correctly gives a "NaN" for the first 2 rows".
This ties back to the way classification algorithms are used and what they can do. The training data contains all the information you want your algorithm to know and be able to act on. The test data is only going to be processed with that information in mind. Even if you (the person) know that the test label is 5 and not included in the training data, the algorithm doesn't know that. It is only going to look at the feature data and then try to predict the label from those. So it can't return nan (or 5, or anything not in the training set) - that nan is coming from your work going from df_R to df_S.
This leads to the second issue which is the line df_Te_Y = TEST .iloc[ : , 1 : ], that line should be df_Te_Y = TEST .iloc[ : , 2 : ], so that it does not include the label data. Label data only appears in the training set. The predicted labels will only ever be drawn from the set of labels that appear in the training data.
Note: I've changed the class labels to be Y and the feature data to be X because that's standard in the literature.
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
import pandas as pd
BNB = BernoulliNB()
# Training Data
train_df = pd.DataFrame({'Y' : [1,2,3,9], 'X1': [1,1,0,0], 'X2': [0,0,0,0], 'X3': [0,0,0,0], 'X4': [1,0,0,0]})
# Test Data
test_df = pd.DataFrame({'Y' : [5,0,1,1,1,2,2,2,2],
'X1': [1,1,0,1,0,1,0,0,0],
'X2': [1,0,1,0,1,0,1,0,1],
'X3': [1,1,0,1,1,0,0,0,0],
'X4': [1,1,0,1,1,0,0,0,0]})
X = train_df.drop('Y', axis=1) # Known training data - all but 'Y' column.
Y = train_df['Y'] # Known training labels - just the 'Y' column.
X_te = test_df.drop('Y', axis=1) # Test data.
Y_te = test_df['Y'] # Only used to measure accuracy of prediction - if desired.
Ar_R = BNB.fit(X, Y).predict_proba(X_te) # Can be combined to a single line.
df_R = pd.DataFrame(Ar_R)
df_R.columns = BNB.classes_ # Rename as per class labels.
# Columns are class labels and Rows are observations.
# Each entry is a probability of that observation being assigned to that class label.
print(df_R)
predicted_labels = df_R.idxmax(axis=1).values # For each row, take the column with the highest prob in that row.
print(predicted_labels) # [1 1 3 1 3 2 3 3 3]
print(accuracy_score(Y_te, predicted_labels)) # Percent accuracy of prediction.
print(BNB.fit(X, Y).predict(X_te)) # [1 1 3 1 3 2 3 3 3], can be used in one line if predicted_label is all we want.
# NOTE: change train_df to have 'Y': [1,2,1,9] and we get predicted_labels = [1 1 9 1 1 1 9 1 9].
# So probabilities have changed.
I recommend reviewing some tutorials or other material on clustering algorithms if this doesn't make sense after reading the code.

How do I split a dataset into training, test, and cross validation sets?

First I normalized numerical data from a 1000x20 array and then created another array containing the random permutation of the row indices of the normalized data. How do I split this new array into a training, cross validation, and test set?
In[150]:
row_indices = np.random.permutation(X_norm.shape[0])
In[151]:
# Create a Training Set - 60 percent of data - 600x20
X_train =
# Create a Cross Validation Set - 20 percent - 200x20
X_crossVal =
# Create a Test Set - 20 percent - 200x20
X_test =
# If you performed the above calculations correctly, then X_train
# should have 600 rows and 20 columns, X_crossVal should have 200 rows
# and 20 columns, and X_test should have 200 rows and 20 columns. You
# can verify this by filling the code below:
In[152]:
# Print the shape of X_train
X_train.shape
# Print the shape of X_crossVal
# Print the shape of X_test
Please forgive how bad I am at stack overflow.
You can use np.split to split your data into chunks of predefined sizes:
X_train, X_crossVal, X_test = np.split(row_indices, [600, 800])
API Documentation
X_train = X_norm[row_indices[0:600]]
Create a Cross Validation Set
X_crossVal = X_norm[row_indices[600:800]]
Create a Test Set
X_test = X_norm[row_indices[800:1000]]
also make sure when you print them that you use:
print(X_train.shape)