How to split mnist dataset into smaller size and adding augmentation to it? - tensorflow

I have this problem of splitting mnist dataset + adding augmentation data. i want to take only total of 22000(including training + test set) data from mnist dataset which is 70000. mnist dataset have 10 label. im only using shear, rotation, width-shift, and heigh-shift for augmetation method.
training set --> 20000(total) --> 20 images + 1980 augmentation images(per label)
test set --> 2000(total) --> 200 images(per label)
i also want to make sure that the class distribution is preserved in the split.
i'm really confused how to split those data. would gladly if anyone can provide the code.
i have tried this code :
# Load the MNIST dataset
(x_train_full, y_train_full), (x_test_full, y_test_full) = keras.datasets.mnist.load_data()
# Normalize the data
x_train_full = x_train_full / 255.0
x_test_full = x_test_full / 255.0
# Create a data generator for data augmentation
data_gen = ImageDataGenerator(shear_range=0.2, rotation_range=20,
width_shift_range=0.2, height_shift_range=0.2)
# Initialize empty lists for the training and test sets
x_train, y_train, x_test, y_test = [], [], [], []
# Loop through each class/label
for class_n in range(10):
# Get the indices of the images for this class
class_indices = np.where(y_train_full == class_n)[0]
# Select 20 images for training
train_indices = np.random.choice(class_indices, 20, replace=False)
# Append the training images and labels to the respective lists
x_train.append(x_train_full[train_indices])
y_train.append(y_train_full[train_indices])
# Select 200 images for test
test_indices = np.random.choice(class_indices, 200, replace=False)
# Append the test images and labels to the respective lists
x_test.append(x_test_full[test_indices])
y_test.append(y_test_full[test_indices])
# Generate 100 augmented images for training
x_augmented = data_gen.flow(x_train_full[train_indices], y_train_full[train_indices], batch_size=100)
# Append the augmented images and labels to the respective lists
x_train.append(x_augmented[0])
y_train.append(x_augmented[1])
# Concatenate the list of images and labels to form the final training and test sets
x_train = np.concatenate(x_train)
y_train = np.concatenate(y_train)
x_test = np.concatenate(x_test)
y_test = np.concatenate(y_test)
print("training set shape: ", x_train.shape)
print("training label shape: ", y_train.shape)
print("test set shape: ", x_test.shape)
print("test label shape: ", y_test.shape)
but it keep saying error like this :
IndexError: index 15753 is out of bounds for axis 0 with size 10000

You are mixing the train and test set. In the loop, you are getting the class_indices from the train set:
# Get the indices of the images for this class
class_indices = np.where(y_train_full == class_n)[0]
but then you are using these train indices (that might be numbers above 10000!) to address indices in the testset (that has only 10000 samples) some lines further down:
# Select 200 images for test
test_indices = np.random.choice(class_indices, 200, replace=False)
So, you will need to do the same index-selection for the label in the loop for the test-set and it should work out.

Related

XGBoost iterative training: Not having all 0,...,C labels in minibatch without erroring

When training XGBoost iteratively for data too large to fit in memory, one may want to use "batches". The problem is, however, that each batch may not contain all 0,...,C labels. This leads to the error ValueError: The label must consist of integer labels of form 0, 1, 2, ..., [num_class-1] -
Is there a way to train XGBoost where we just have some subset of the labels, which may not contain zero?
The code has structure similar to this:
train = module.trainloader
test = module.valloader
# Train on one minibatch to get started
sample = next(iter(loader))
X = xgb.DMatrix(sample[0].numpy(), label=sample[1].numpy())
params = {
'learning_rate': 0.007,
'updater':'refresh',
'process_type': 'update',
}
# Get initial model training
model = xgb.train(params, dtrain=X)
for i, (trainsample, valsample) in enumerate(zip(train, test)):
X_train, y_train = trainsample
X_test, y_test = valsample
X_train = xgb.DMatrix(X_train, labels=y_train)
X_test = xgb.DMatrix(X_test)
model = xgb.train(params, dtrain=X_train, xgb_model=model)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

CNN sudden drop in accuracy after steady increase over ~24 epochs

I'm attempting to create a CNN to classify emotions based on facial expressions from an image. I'm using this dataset I found on Kaggle, in csv format with each row having grayscale image data and emotion. It has ~30,000 data points randomly split into 80% training, 10% validation and 10% testing.
While adjusting settings and structure of my CNN I've experienced at least one of these problems at a time.
CNN only (or mainly) predicting one or two outputs.
Validation Accuracy that does not change at all
Accuracy fluctuating around equivalent to random guessing (16.66%) +/- 10%
Constant accuracy and loss for both training and data
I've tried changing learning rate, optimizer, batch size, number of filters, epochs. I've used a different data set and tried class weights for slight imbalance as well as varying numbers of hidden layers and complexities.
I've also tried training on one sample but got this for accuracy and this for loss.
The following graphs are for the configuration currently in the code, though changing what I've listed above just results in any of the problems I mentioned.
Train/Val Accuracy during training
Train/Vall Loss during training
I suspected it may be due to labels being mismatched with data, so I manually check whether labels and images matched in train_set, which they did.
However using the following code to check them in all_sets (after conversion) gave several mismatched data samples out of 16.
for num, row in enumerate(all_sets[0]):
if num < 16:
newimage = tensorflow.keras.preprocessing.image.array_to_img(row)
newimage.save("filename.png") # takes type from filename extension
display(Image('filename.png'))
print(classes[all_labels[0][num]])
!rm -rf filename.png
Conversion code in question:
tempSets = [train_set, test_set, val_set] #store all sets in list to iterate through
#panda reads image from CSV as a string rather than an array of floats.
#convert from panda dataframe to numpy array, as easier to feed into image augmentation
set_sizes = [train_set.shape[0] , test_set.shape[0], val_set.shape[0]] #Array storing number of records for each set
all_sets = [np.empty([set_sizes[0], 48,48,1], dtype=float), # train
np.empty([set_sizes[1], 48,48,1], dtype=float), # test
np.empty([set_sizes[2], 48,48,1], dtype=float)] # validate
all_labels = [np.empty(set_sizes[0], dtype=int), # train
np.empty(set_sizes[1], dtype=int), # test
np.empty(set_sizes[2], dtype=int)] # validate
for count, val in enumerate(tempSets): # for each set
for num, row in enumerate(val.itertuples()): #each row in a set, store as a tuple and keep index num
num_str_list = row._2.split() # split long string into array of string seperated by whitespaces (_2 is emotion, for some reason it renames itself)
for convertI in range(len(num_str_list)): # for each element in string array
num_str_list[convertI] = float(num_str_list[convertI]) #convert to float and store in new (1d) array
for xPixel in range(48):
for yPixel in range(48):
#match indexes to map 1d array contents into 2d
all_sets[count][num, xPixel, yPixel, 0] = num_str_list[xPixel*48 + yPixel]
all_labels[count][num] = data['emotion'][num] #assign labels with matching index
I'm afraid I'm lost as to where the issue is, as I can't find problems in my conversion or in displaying the images. I would appreciate any help. Thank you.
My full code:
drive.mount('/content/drive', force_remount=False)
driveContent='/content/drive/My Drive/EmotionClassification'
#Set base_dir as current working directory
base_dir = os.getcwd()
#Check whether training data is present in current working directory
dataSetIsInColab = False
for fname in os.listdir(base_dir): #get all names in current directory
if fname == 'icml_face_data.csv':
dataSetIsInColab = True
if dataSetIsInColab == False:
#dataset.zip is a zip file containing a CSV file which holds the dataset.
zip_path = os.path.join(driveContent,'dataset.zip') #grab content from drive
!cp "{zip_path}" . #Copy zip file to current working directory
!unzip -q dataset.zip # Unzip (-q prevents printing all files)
!rm dataset.zip #Remove zip file since we now have unzipped data
data = panda.read_csv('icml_face_data.csv') # import data as panda dataframe object
data.pop(' Usage') #remove unnecessary column
data.rename(columns={" pixels":"pixels"}) # remove space from column name as it can cause issues
data[data['emotion']!=1] #remove disgust emotion due to large imbalance
data.loc[data['emotion'] > 1, 'emotion'] = data['emotion'] - 1 #shift emotion values to fill disgust slot
classes = ['anger', 'fear', 'happiness', 'sadness', 'surprise', 'neutral']
#Use panda dataframe methods to randomly split dataset
train_set = data.sample(frac=0.8) # Train set is 80%
temp_set = data.drop(train_set.index) # Temp set is 20%
test_set = temp_set.sample(frac=0.5) #Test set is half of 20%
val_set = temp_set.drop(test_set.index) #Val set is the other half of 20%
#train_set.reset_index(drop=True, inplace=True)
#test_set.reset_index(drop=True, inplace=True)
#val_set.reset_index(drop=True, inplace=True)
tempSets = [train_set, test_set, val_set] #store all sets in list to iterate through
#panda reads image from CSV as a string rather than an array of floats.
#convert from panda dataframe to numpy array, as easier to feed into image augmentation
set_sizes = [train_set.shape[0] , test_set.shape[0], val_set.shape[0]] #Array storing number of records for each set
all_sets = [np.empty([set_sizes[0], 48,48,1], dtype=float), # train
np.empty([set_sizes[1], 48,48,1], dtype=float), # test
np.empty([set_sizes[2], 48,48,1], dtype=float)] # validate
all_labels = [np.empty(set_sizes[0], dtype=int), # train
np.empty(set_sizes[1], dtype=int), # test
np.empty(set_sizes[2], dtype=int)] # validate
for count, val in enumerate(tempSets): # for each set
for num, row in enumerate(val.itertuples()): #each row in a set, store as a tuple and keep index num
num_str_list = row._2.split() # split long string into array of string seperated by whitespaces (_2 is emotion, for some reason it renames itself)
for convertI in range(len(num_str_list)): # for each element in string array
num_str_list[convertI] = float(num_str_list[convertI]) #convert to float and store in new (1d) array
for xPixel in range(48):
for yPixel in range(48):
#match indexes to map 1d array contents into 2d
all_sets[count][num, xPixel, yPixel, 0] = num_str_list[xPixel*48 + yPixel]
all_labels[count][num] = data['emotion'][num] #assign labels with matching index
#Check to use GPU if possible
gpu_name = tensorflow.test.gpu_device_name()
if gpu_name != '/device:GPU:0':
print('GPU device not found')
print('Found GPU at: {}'.format(gpu_name))
#SET UP IMAGE AUGMENTATION AND INPUT
imgShape=(48,48,1) # input image dimensions (greyscale)
batchSize = 64
#this datagen is applied to training sets
dataGen = ImageDataGenerator(
rescale=1./255,
rotation_range=45,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
vertical_flip = True)
# Set up data generator batches, shapes and class modes
trainingDataGen = dataGen.flow(
x=all_sets[0],
y=tensorflow.keras.utils.to_categorical(all_labels[0]),
batch_size=batchSize)
testGen = ImageDataGenerator(rescale=1./255).flow(
x=all_sets[1],
y=tensorflow.keras.utils.to_categorical(all_labels[1]))
validGen = ImageDataGenerator(rescale=1./255).flow(
x=all_sets[2],
y=tensorflow.keras.utils.to_categorical(all_labels[2]),
batch_size=batchSize)
# BUILD THE MODEL
model = Sequential()
model.add(Conv2D(256, (3,3), activation='relu', input_shape = imgShape))
model.add(MaxPooling2D((2,2)))
model.add(Dropout(0.1))
model.add(Conv2D(512, (3,3), activation='relu'))
model.add(MaxPooling2D((2,2)))
model.add(Dropout(0.1))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(6, activation='softmax'))
model.summary() # Show the structure of the network
opt = tensorflow.keras.optimizers.Adam(learning_rate = 0.0001) # initialise optimizer
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
weights = sklearn.utils.class_weight.compute_class_weight('balanced', np.unique(all_labels[0]), all_labels[0]) # use class weights for imbalanced classes
weights = {i : weights[i] for i in range(6)}
history = model.fit(
trainingDataGen,
batch_size = batchSize,
epochs = 32,
class_weight = weights,
validation_data = validGen
)
EDIT: I found the issue in my conversion, I was assigning labels based on the unsplit data instead of each split set individually.
From all_labels[count][num] = data['emotion'][num] to
all_labels[count][num] = row.emotion
Labels are matching now. Training accuracy/loss steadily improves, but validation accuracy and loss are fluctuating a lot. I'm going to do some more tweaking, but is this something I should worry about?
EDIT 2: Made title more fitting
So I left it training for more epochs, and here are the results.
accuracy and loss.
Peaking at 35% isn't ideal
EDIT 3: It's back to fluctuating wildly.
accuracy and loss.
I used some code to test predictions of my network after training. Here are results;
Total tests performed: 4068
Total accuracy: 0.17649950835791545
anger : {'Successes': 0, 'Occured': 567} Accuracy: 0.0
fear : {'Successes': 0, 'Occured': 633} Accuracy: 0.0
happiness : {'Successes': 0, 'Occured': 1023} Accuracy: 0.0
sadness : {'Successes': 0, 'Occured': 685} Accuracy: 0.0
surprise : {'Successes': 0, 'Occured': 442} Accuracy: 0.0
neutral : {'Successes': 718, 'Occured': 718} Accuracy: 1.0
Code below;
#Test our model's performance
truth = []
predict = []
#Test a batch of 128 test images against the model
for i in range(128):
a,b = next(testGen)
predict.append(model.predict(a))
truth.append(b)
predict = np.concatenate(predict) #Array of predictions model has made
truth = np.concatenate(truth) #Array of true results
successes = 0
items = []
for i in range(len(classes)):
items.append({"Successes" : 0, "Occured" : 0})
for i in range(0, len(predict)):
items[np.argmax(truth[i])]["Occured"] += 1
if (np.argmax(predict[i]) == np.argmax(truth[i])): # If the model's prediction matches the true value
successes += 1
items[np.argmax(truth[i])]["Successes"] += 1
print("Total tests performed: ", len(predict))
print("Total accuracy: ", successes/len(predict))
print()
for i in range(len(classes)):
print(classes[i], ':', items[i], "Accuracy: ", (items[i]["Successes"]/items[i]["Occured"]))

Tf.Print() doesn't print the shape of the tensors?

I have written a simple classification program using Tensorflow and getting the output except I tried to print the shape of tensors for Model parameters, features & bias.
The function definations:
import tensorflow as tf, numpy as np
from tensorflow.examples.tutorials.mnist import input_data
def get_weights(n_features, n_labels):
# Return weights
return tf.Variable( tf.truncated_normal((n_features, n_labels)) )
def get_biases(n_labels):
# Return biases
return tf.Variable( tf.zeros(n_labels))
def linear(input, w, b):
# Linear Function (xW + b)
# return np.dot(input,w) + b
return tf.add(tf.matmul(input,w), b)
def mnist_features_labels(n_labels):
"""Gets the first <n> labels from the MNIST dataset
"""
mnist_features = []
mnist_labels = []
mnist = input_data.read_data_sets('dataset/mnist', one_hot=True)
# In order to make quizzes run faster, we're only looking at 10000 images
for mnist_feature, mnist_label in zip(*mnist.train.next_batch(10000)):
# Add features and labels if it's for the first <n>th labels
if mnist_label[:n_labels].any():
mnist_features.append(mnist_feature)
mnist_labels.append(mnist_label[:n_labels])
return mnist_features, mnist_labels
The graph creation :
# Number of features (28*28 image is 784 features)
n_features = 784
# Number of labels
n_labels = 3
# Features and Labels
features = tf.placeholder(tf.float32)
labels = tf.placeholder(tf.float32)
# Weights and Biases
w = get_weights(n_features, n_labels)
b = get_biases(n_labels)
# Linear Function xW + b
logits = linear(features, w, b)
# Training data
train_features, train_labels = mnist_features_labels(n_labels)
print("Total {0} data points of Training Data, each having {1} features \n \
Total {2} number of labels,each having 1-hot encoding {3}".format(len(train_features),len(train_features[0]),\
len(train_labels),train_labels[0]
)
)
# global variables initialiser
init= tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
The problem is here :
# shapes =tf.Print ( tf.shape(features), [tf.shape(features),
# tf.shape(labels),
# tf.shape(w),
# tf.shape(b),
# tf.shape(logits)
# ], message= "The shapes are:" )
# print("Verify shapes",shapes)
logits = tf.Print(logits, [tf.shape(features),
tf.shape(labels),
tf.shape(w),
tf.shape(b),
tf.shape(logits)],
message= "The shapes are:")
print(logits)
I looked at here, but didn't find much useful.
# Softmax
prediction = tf.nn.softmax(logits)
# Cross entropy
# This quantifies how far off the predictions were.
# You'll learn more about this in future lessons.
cross_entropy = -tf.reduce_sum(labels * tf.log(prediction), reduction_indices=1)
# Training loss
# You'll learn more about this in future lessons.
loss = tf.reduce_mean(cross_entropy)
# Rate at which the weights are changed
# You'll learn more about this in future lessons.
learning_rate = 0.08
# Gradient Descent
# This is the method used to train the model
# You'll learn more about this in future lessons.
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss)
# Run optimizer and get loss
_, l = session.run(
[optimizer, loss],
feed_dict={features: train_features, labels: train_labels})
# Print loss
print('Loss: {}'.format(l))
The output I am getting is :
Extracting dataset/mnist/train-images-idx3-ubyte.gz
Extracting dataset/mnist/train-labels-idx1-ubyte.gz
Extracting dataset/mnist/t10k-images-idx3-ubyte.gz
Extracting dataset/mnist/t10k-labels-idx1-ubyte.gz
Total 3118 data points of Training Data, each having 784 features
Total 3118 number of labels,each having 1-hot encoding [0. 1. 0.]
Tensor("Print_22:0", shape=(?, 3), dtype=float32)
Loss: 5.339271068572998
Could anyone help me understand, Why I am not able to see the shapes of the tensors?
That is not how you use tf.Print. It is an op that does nothing on its own (simply returns the input) but prints the requested tensors as a side effect. You should do something like
logits = tf.Print(logits, [tf.shape(features),
tf.shape(labels),
tf.shape(w),
tf.shape(b),
tf.shape(logits)],
message= "The shapes are:")
Now, whenever logits is evaluated (as it will be for computing the loss/gradients), the shape information will be printed.
What you are doing right now is simply printing the return value of the tf.Print op, which is just its input (tf.shape(features)).
After #xdurch0 suggestion, I tried this
shapes = tf.Print(logits, [tf.shape(features),
tf.shape(labels),
tf.shape(w),
tf.shape(b),
tf.shape(logits)],
message= "The shapes are:")
# Run optimizer and get loss
_, l, resultingShapes = session.run( [optimizer, loss, shapes],
feed_dict={features: train_features, labels: train_labels})
print('The shapes are: '. resultingShapes.shape)
and it worked partially,
Extracting dataset/mnist/train-images-idx3-ubyte.gz
Extracting dataset/mnist/train-labels-idx1-ubyte.gz
Extracting dataset/mnist/t10k-images-idx3-ubyte.gz
Extracting dataset/mnist/t10k-labels-idx1-ubyte.gz
Total 3118 data points of Training Data, each having 784 features
Total 3118 number of labels, each having 1-hot encoding [0. 1. 0.]
The shapes are: (3118, 3)
Loss: 10.223002433776855
could #xdurch0 suggest something to get the desired results?
My DESIRED RESULTS are:
tf.shape(features): (3118, 784) tf.shape(labels) :(3118, 3) ,
tf.shape(w) : (784,3), tf.shape(b) : (3,1), tf.shape(logits):(3118,3)

logits and labels must be same size logits_size

hi i used my own dataset for train the model but i have error that i mention below . my dataset has 124 class and lables are 0 to 123 , size is 60*60 gray , batch is 10 and result is :
lables.eval() --> [ 1 101 101 103 103 103 103 100 102 1] -- len(lables.eval())= 10
orginal pic size -- > (?, 60, 60, 1)
First convolutional layer (?, 30, 30, 32)
Second convolutional layer. (?, 15, 15, 64)
flatten. (?, 14400)
dense .1 (?, 2048)
dense .2 (?, 124)
error
ensorflow.python.framework.errors_impl.InvalidArgumentError: logits and
labels must have the same first dimension, got logits shape [40,124] and
labels shape [10]
code
def model_fn(features, labels, mode, params):
# Reference to the tensor named "image" in the input-function.
x = features["image"]
# The convolutional layers expect 4-rank tensors
# but x is a 2-rank tensor, so reshape it.
net = tf.reshape(x, [-1, img_size, img_size, num_channels])
# First convolutional layer.
net = tf.layers.conv2d(inputs=net, name='layer_conv1',
filters=32, kernel_size=3,
padding='same', activation=tf.nn.relu)
net = tf.layers.max_pooling2d(inputs=net, pool_size=2, strides=2)
# Second convolutional layer.
net = tf.layers.conv2d(inputs=net, name='layer_conv2',
filters=64, kernel_size=3,
padding='same', activation=tf.nn.relu)
net = tf.layers.max_pooling2d(inputs=net, pool_size=2, strides=2)
# Flatten to a 2-rank tensor.
net = tf.contrib.layers.flatten(net)
# Eventually this should be replaced with:
# net = tf.layers.flatten(net)
# First fully-connected / dense layer.
# This uses the ReLU activation function.
net = tf.layers.dense(inputs=net, name='layer_fc1',
units=2048, activation=tf.nn.relu)
# Second fully-connected / dense layer.
# This is the last layer so it does not use an activation function.
net = tf.layers.dense(inputs=net, name='layer_fc_2',
units=num_classes)
# Logits output of the neural network.
logits = net
y_pred = tf.nn.softmax(logits=logits)
y_pred_cls = tf.argmax(y_pred, axis=1)
if mode == tf.estimator.ModeKeys.PREDICT:
spec = tf.estimator.EstimatorSpec(mode=mode,
predictions=y_pred_cls)
else:
cross_entropy =
tf.nn.sparse_softmax_cross_entropy_with_logits(labels=labels,
logits=logits)
loss = tf.reduce_mean(cross_entropy)
optimizer =
tf.train.AdamOptimizer(learning_rate=params["learning_rate"])
train_op = optimizer.minimize(
loss=loss, global_step=tf.train.get_global_step())
metrics = \
{
"accuracy": tf.metrics.accuracy(labels, y_pred_cls)
}
spec = tf.estimator.EstimatorSpec(
mode=mode,
loss=loss,
train_op=train_op,
eval_metric_ops=metrics)
return spec`
this lables comes from here via tfrecords:
def input_fn(filenames, train, batch_size=10, buffer_size=2048):
# Args:
# filenames: Filenames for the TFRecords files.
# train: Boolean whether training (True) or testing (False).
# batch_size: Return batches of this size.
# buffer_size: Read buffers of this size. The random shuffling
# is done on the buffer, so it must be big enough.
# Create a TensorFlow Dataset-object which has functionality
# for reading and shuffling data from TFRecords files.
dataset = tf.data.TFRecordDataset(filenames=filenames)
# Parse the serialized data in the TFRecords files.
# This returns TensorFlow tensors for the image and labels.
dataset = dataset.map(parse)
if train:
# If training then read a buffer of the given size and
# randomly shuffle it.
dataset = dataset.shuffle(buffer_size=buffer_size)
# Allow infinite reading of the data.
num_repeat = None
else:
# If testing then don't shuffle the data.
# Only go through the data once.
num_repeat = 1
# Repeat the dataset the given number of times.
dataset = dataset.repeat(num_repeat)
# Get a batch of data with the given size.
dataset = dataset.batch(batch_size)
# Create an iterator for the dataset and the above modifications.
iterator = dataset.make_one_shot_iterator()
# Get the next batch of images and labels.
images_batch, labels_batch = iterator.get_next()
# The input-function must return a dict wrapping the images.
x = {'image': images_batch}
y = labels_batch
print(x, ' - ', y.get_shape())
return x, y
i generate labeles via this code for example image name=math-1 , lable = 1
def get_lable_and_image(path):
lbl = []
img = []
for filename in glob.glob(os.path.join(path, '*.png')):
img.append(filename)
lable = filename[41:].split()[0].split('-')[1]
lbl.append(int(lable))
lables = np.array(lbl)
images = np.array(img)
# print(images[1], lables[1])
return images, lables
i push images and lables to create tfrecords
def convert(image_paths, labels, out_path):
# Args:
# image_paths List of file-paths for the images.
# labels Class-labels for the images.
# out_path File-path for the TFRecords output file.
print("Converting: " + out_path)
# Number of images. Used when printing the progress.
num_images = len(image_paths)
# Open a TFRecordWriter for the output-file.
with tf.python_io.TFRecordWriter(out_path) as writer:
# Iterate over all the image-paths and class-labels.
for i, (path, label) in enumerate(zip(image_paths, labels)):
# Print the percentage-progress.
print_progress(count=i, total=num_images-1)
# Load the image-file using matplotlib's imread function.
img = imread(path)
# Convert the image to raw bytes.
img_bytes = img.tostring()
# Create a dict with the data we want to save in the
# TFRecords file. You can add more relevant data here.
data = \
{
'image': wrap_bytes(img_bytes),
'label': wrap_int64(label)
}
# Wrap the data as TensorFlow Features.
feature = tf.train.Features(feature=data)
# Wrap again as a TensorFlow Example.
example = tf.train.Example(features=feature)
# Serialize the data.
serialized = example.SerializeToString()
# Write the serialized data to the TFRecords file.
writer.write(serialized)

Link prediction with input data

I have a list of files, an I use the KNN algorithm to classify these files.
dataset = pd.read_csv(file)
training_samples = get_sample_number(dataset)
X_train = dataset.iloc[:training_samples, 5:9]
y_train = dataset.iloc[:training_samples, 9]
X_test = dataset.iloc[training_samples:, 5:9]
# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)
# Fitting classifier to the training set
classifier = KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
Now I have my categories in my y_pred array. But I want to save the result in the file where I read the dataset. How can I link a prediction to the right row in the file (or dataset)?
Your predictions in y_pred have a length of X_test.shape[0], which is obviously less than the length of the original dataset. If you want to attach the predictions to the original dataset that you read from file, you would need to make predictions on the whole dataset, and then do a simple concat to get it all together:
y_pred_all = classifier.predict(dataset.iloc[:, 5:9])
dataset = pd.concat([dataset, y_pred_all], axis=1)