I have a list of files, an I use the KNN algorithm to classify these files.
dataset = pd.read_csv(file)
training_samples = get_sample_number(dataset)
X_train = dataset.iloc[:training_samples, 5:9]
y_train = dataset.iloc[:training_samples, 9]
X_test = dataset.iloc[training_samples:, 5:9]
# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
# Fitting classifier to the training set
classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2), y_train)
y_pred = classifier.predict(X_test)
Now I have my categories in my y_pred array. But I want to save the result in the file where I read the dataset. How can I link a prediction to the right row in the file (or dataset)?

Your predictions in y_pred have a length of X_test.shape[0], which is obviously less than the length of the original dataset. If you want to attach the predictions to the original dataset that you read from file, you would need to make predictions on the whole dataset, and then do a simple concat to get it all together:
y_pred_all = classifier.predict(dataset.iloc[:, 5:9])
dataset = pd.concat([dataset, y_pred_all], axis=1)


How to split mnist dataset into smaller size and adding augmentation to it?

I have this problem of splitting mnist dataset + adding augmentation data. i want to take only total of 22000(including training + test set) data from mnist dataset which is 70000. mnist dataset have 10 label. im only using shear, rotation, width-shift, and heigh-shift for augmetation method.
training set --> 20000(total) --> 20 images + 1980 augmentation images(per label)
test set --> 2000(total) --> 200 images(per label)
i also want to make sure that the class distribution is preserved in the split.
i'm really confused how to split those data. would gladly if anyone can provide the code.
i have tried this code :
# Load the MNIST dataset
(x_train_full, y_train_full), (x_test_full, y_test_full) = keras.datasets.mnist.load_data()
# Normalize the data
x_train_full = x_train_full / 255.0
x_test_full = x_test_full / 255.0
# Create a data generator for data augmentation
data_gen = ImageDataGenerator(shear_range=0.2, rotation_range=20,
width_shift_range=0.2, height_shift_range=0.2)
# Initialize empty lists for the training and test sets
x_train, y_train, x_test, y_test = [], [], [], []
# Loop through each class/label
for class_n in range(10):
# Get the indices of the images for this class
class_indices = np.where(y_train_full == class_n)[0]
# Select 20 images for training
train_indices = np.random.choice(class_indices, 20, replace=False)
# Append the training images and labels to the respective lists
# Select 200 images for test
test_indices = np.random.choice(class_indices, 200, replace=False)
# Append the test images and labels to the respective lists
# Generate 100 augmented images for training
x_augmented = data_gen.flow(x_train_full[train_indices], y_train_full[train_indices], batch_size=100)
# Append the augmented images and labels to the respective lists
# Concatenate the list of images and labels to form the final training and test sets
x_train = np.concatenate(x_train)
y_train = np.concatenate(y_train)
x_test = np.concatenate(x_test)
y_test = np.concatenate(y_test)
print("training set shape: ", x_train.shape)
print("training label shape: ", y_train.shape)
print("test set shape: ", x_test.shape)
print("test label shape: ", y_test.shape)
but it keep saying error like this :
IndexError: index 15753 is out of bounds for axis 0 with size 10000
You are mixing the train and test set. In the loop, you are getting the class_indices from the train set:
# Get the indices of the images for this class
class_indices = np.where(y_train_full == class_n)[0]
but then you are using these train indices (that might be numbers above 10000!) to address indices in the testset (that has only 10000 samples) some lines further down:
# Select 200 images for test
test_indices = np.random.choice(class_indices, 200, replace=False)
So, you will need to do the same index-selection for the label in the loop for the test-set and it should work out.

Generate and assemble model predictions for each stratified kfold test split

I would like to generate multiple test data splits using stratified KFold (skf) and then generate/assemble predictions for each of these test data splits (and hence all of the data) using a sklearn model. I am at a wits end on how to do this programmatically.
I have recaptured my code using a minimal data example below. Briefly, (after data import), I have a function that does the model fit and generates model predicted probabilities. Subsequently, I attempt to pass this function to each skf split of my data so as to generate and subsequently collate predicted probabilities for each row of my data. However, this step fails and generates a valueerror (boolean array expected). My code follows below:
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
#load data, assemble dataframe
iris = datasets.load_iris()
X = pd.DataFrame([51:150, :], columns = ["sepal_length", "sepal_width",
"petal_length", "petal_width"])
y = pd.DataFrame([51:150,], columns = ["target"])
df = pd.concat([X,y], axis = 1)
#instantiate logistic regression
log = LogisticRegression()
#modelling function
def train_model(train, test, fold):
X = df.drop("target", axis = 1)
y = df["target"]
X_train = train[X]
y_train = train[y]
X_test = test[X]
y_test = test[y]
#generate probability of class 1 predictions from logistic regression model fit
prob =, y_train).predict_proba(X_test)[:, 1]
return (prob)
#generate straified k-fold splits (2 used as example here)
skf = StratifiedKFold(n_splits = 2)
#generate and collate all predictions (for each row in df)
fold = 1
outputs = []
for train_index, test_index in skf.split(df, y):
train_df = df.loc[train_index,:]
test_df = df.loc[test_index,:]
output = train_model(train_df,test_df,fold) #generate model probabilities for X_test
in skf split
outputs.append(output) #append all model probabilities
fold = fold + 1
all_preds = pd.concat(outputs)
Can somebody please guide me to the solution that includes row index and its predicted probability?

XGBoost iterative training: Not having all 0,...,C labels in minibatch without erroring

When training XGBoost iteratively for data too large to fit in memory, one may want to use "batches". The problem is, however, that each batch may not contain all 0,...,C labels. This leads to the error ValueError: The label must consist of integer labels of form 0, 1, 2, ..., [num_class-1] -
Is there a way to train XGBoost where we just have some subset of the labels, which may not contain zero?
The code has structure similar to this:
train = module.trainloader
test = module.valloader
# Train on one minibatch to get started
sample = next(iter(loader))
X = xgb.DMatrix(sample[0].numpy(), label=sample[1].numpy())
params = {
'learning_rate': 0.007,
'process_type': 'update',
# Get initial model training
model = xgb.train(params, dtrain=X)
for i, (trainsample, valsample) in enumerate(zip(train, test)):
X_train, y_train = trainsample
X_test, y_test = valsample
X_train = xgb.DMatrix(X_train, labels=y_train)
X_test = xgb.DMatrix(X_test)
model = xgb.train(params, dtrain=X_train, xgb_model=model)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

How to get the labels from tensorflow dataset

ds_test =
file_pattern = "./dfj_test/part-*.csv.gz",
batch_size=batch_size, num_epochs=1,
#select_columns= select_cols,
num_parallel_reads=30, compression_type='GZIP',
This is my tesetset during training. After completing the model, I want to zip the columns of predictions and labels for the df_test .
preds = model.predict(df_test)
Getting the predictions is quite simple, and it is of numpy array format. However, I don't know how to get the corresponding labels from the df_test.
I want to zip(preds, labels) for further analysis.
Any hint? Thanks.
(tf version 2.3.1)
You can map each example to return the field you want
# load some exemplary data
train_file_path = tf.keras.utils.get_file("train.csv", TRAIN_DATA_URL)
dataset =, batch_size=100, num_epochs=1)
# get field by unbatching
labels_iterator= dataset.unbatch().map(lambda x: x['survived']).as_numpy_iterator()
labels = np.array(list(labels_iterator))
# get field by concatenating batches
labels_iterator= x: x['survived']).as_numpy_iterator()
labels = np.concatenate(list(labels_iterator))

Preprocessing for TensorFlow Dataset 'cats_vs_dogs'

I am trying to create a preprocessing function so that the training_dataset can be directly fed into a keras sequential neural network. The preprocess function should return features and labels.
def preprocessing_function(data):
features = ...
labels = ...
return features, labels
dataset, info = tfds.load(name='cats_vs_dogs', split=tfds.Split.TRAIN, with_info=True)
training_dataset =
How should I write the preprocessing_function? I spent several hours researching and trying to make it happen, but to no avail. Hoping someone can assist.
Here are two functions for preprocessing. FIrst one will be applied to both train and validation data to normalize the data and resize to the expected size of network. The second function, augmentation, will be applied to training set only. The type of augmentation you want to do depends on your dataset and application, but I provided this as an example.
#Fetching, pre-processing & preparing data-pipeline
def preprocess(ds):
x = tf.image.resize_with_pad(ds['image'], IMG_SIZE_W, IMG_SIZE_H)
x = tf.cast(x, tf.float32)
y = tf.one_hot(ds['label'], NUM_CLASSES)
return x, y
def augmentation(image,label):
image = tf.image.random_flip_left_right(image)
image = tf.image.resize_with_crop_or_pad(image, IMG_W+4, IMG_W+4) # zero pad each side with 4 pixels
image = tf.image.random_crop(image, size=[BATCH_SIZE, IMG_W, IMG_H, 3]) # Random crop back to 32x32
return image, label
and to load training and validation datasets, do something like this:
def get_dataset(dataset_name, shuffle_buff_size=1024, batch_size=BATCH_SIZE, augmented=True):
train, info_train = tfds.load(dataset_name, split='train[:80%]', with_info=True)
val, info_val = tfds.load(dataset_name, split='train[80%:]', with_info=True)
TRAIN_SIZE = info_train.splits['train'].num_examples * 0.8
VAL_SIZE = info_train.splits['train'].num_examples * 0.2
train =
if augmented==True:
train =
train = train.prefetch(
val =
val = val.prefetch(
return train, info_train, val, info_val, TRAIN_SIZE, VAL_SIZE