How to correctly split MNIST dataset into training and validation set? - tensorflow

I have a really simple code that takes the training data from MNIST and then chooses the last 10,000 examples as validation set, then deletes the last 10,000 examples from the training set.
import tensorflow as tf
import tensorflow_datasets as tfds
(X_train, Y_train)= tf.keras.datasets.mnist.load_data()
X_valid = X_train[-10000:]
Y_valid = Y_train[-10000:]
X_train = X_train[0:40000]
Y_train = Y_train[0:40000]
However, this is very dumb in my opinion and I would like to make the data splitting procedure more sophisticated in the following ways:
I should specify which percentage of the data I want as validation set, instead of just taking the last whatever samples
I need a way to make sure that the data is balanced after I partition it into training and validation. Grabbing a portion could cause a the training examples associated with some digits to be very few.
Surprisingly I went through almost every Tensorflow tutorial and none of them does any validation (except for https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch, which uses the same dumb data splitting methodology as above). Most examples just directly splits the data into train and test which we almost never do in real life.
Could someone please advise?

The keras.datasets.mnist dataset loads the dataset by Yann LeCun (Refer Doc)
The dataset is setup in such a way that it contains 60,000 training data and 10,000 testing data.
Since the load_data() just returns Numpy arrays, you can easily concatenate the train and test arrays into a single array, after which you can play with the new array as you like.
If you want a validation set out of the training set, then you can shuffle the training set first and then extract the validation set.
All these operations will be simple Numpy Array operations and wouldn't even require any Tensorflow functionality.

Related

Defining assignment function as variable in tensroflow?

I am training a neural network by SGD (batch size = 1). The inputs are randomly generated, and the labels are calculated based on the input. AKA the data does not have to be realistic, but the relationships between inputs and labels are specific. I will train my NN only 1 epoch, but with many batches.
I have the following code:
training_input = tf.Variable(tf.zeros(...))
assign_training_input_with_random_values = training_input.assign(tf.random_normal(...))
//Create a session, initialize a bunch of variables, construct a neural network...
for batch in range(batch_number):
sess.run(assign_training_input_with_random_values)
//Train my neural network...
However I noticed that if I write the above code differently the speed goes down by a lot:
//Run the assignment operation directly without defining it as a variable
for batch in range(batch_size)
sess.run(training_input.assign(tf.random_normal(...)))
//Train my neural network...
The first snippet being significantly faster makes me worry that tensorflow is only randomizing when I define the assign_training_input_with_random_values variable, and the same training examples are fed to the NN over every batch afterwards. In this case, the NN will probably not generalize well. Meanwhile, the second snippet is slow because it is randomizing every batch. Is this actually the case or is there another reason for this?
First the explanation to your observations
Computational difference between 1st and 2nd solutions
It makes sense that your first solution is faster than the second. You define the assign operation once and then execute that for 100 epochs. However in the 2nd solution you create an op every epoch, growing the computational graph over time which causes your program to slow down.
Observation about the 1st solution
(After #Y.Z.'s finding) Apparently the first solution does evaluate to different random number arrays every time you run it. Therefore, the first solution is also valid.
Another way to implement this
The correct way to implement your solution would be to use a tf.placeholder to feed values in every epoch the following way.
import tensorflow as tf
import numpy as np
training_input = tf.Variable(tf.zeros(shape=[3, 2]))
tf_random = tf.placeholder(shape=[3, 2], dtype=tf.float32)
assign_training_input_with_random_values = training_input.assign(tf_random)
#Create a session, initialize a bunch of variables, construct a neural network...
epoch=0
with tf.Session() as sess:
while epoch < 10:
epoch+= 1
sess.run(assign_training_input_with_random_values, feed_dict={tf_random:np.random.normal(size=(3,2))})
Comparing Solution 1 vs My solution
So turns out, both your first solution and my solution will not grow the graph. If you run the line
print([n.name for n in tf.get_default_graph().as_graph_def().node])
for your first solution and my solution (Be careful to run tf.reset_default_graph() at the beginning) you'll see that the number of tensors remain constant regardless of the number of iterations. Appears that TensorFlow is smart enough to prune those old tf.random tensors no longer used.

The established way to use TF Dataset API in Keras is to feed `model.fit` with `make_one_shot_iterator()`, But this iterator only good for one Epoch

Edit:
To clarify why this question is different from the suggested duplicates, this SO question follows up on those suggested duplicates, on what exactly is Keras doing with the techniques described in those SO questions. The suggested duplicates specify using a dataset API make_one_shot_iterator() in model.fit, my follow up is that make_one_shot_iterator() can only go through the dataset once, however in the solutions given, several epochs are specified.
This is a follow up to these SO questions
How to Properly Combine TensorFlow's Dataset API and Keras?
Tensorflow keras with tf dataset input
Using tf.data.Dataset as training input to Keras model NOT working
Where "Starting from Tensorflow 1.9, one can pass tf.data.Dataset object directly into keras.Model.fit() and it would act similar to fit_generator". Each example has a TF dataset one shot iterator fed into Kera's model.fit.
An example is given below
# Load mnist training data
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
training_set = tfdata_generator(x_train, y_train,is_training=True)
model = # your keras model here
model.fit(
training_set.make_one_shot_iterator(),
steps_per_epoch=len(x_train) // 128,
epochs=5,
verbose = 1)
However, according the the Tensorflow Dataset API guide (here https://www.tensorflow.org/guide/datasets ) :
A one-shot iterator is the simplest form of iterator, which only
supports iterating once through a dataset
So it's only good for 1 epoch. However, the codes in the SO questions specify several epochs, with the code example above specifying 5 epochs.
Is there any explanation for this contradiction? Does Keras somehow know that when the one shot iterator has gone through the dataset, it can re-initialize and shuffle the data?
You can simply pass dataset object to model.fit, Keras will handle iteration.
Considering one of pre-made datasets:
train, test = tf.keras.datasets.cifar10.load_data()
dataset = tf.data.Dataset.from_tensor_slices((train[0], train[1]))
This will create dataset object from training data of cifar10 dataset. In this case parse function isn't needed.
If you create dataset from path containing images of list of numpy arrays you'll need one.
dataset = tf.data.Dataset.from_tensor_slices((image_path, labels_path))
In case you'll need a function to load actual data from filename. Numpy array can be handled the same way just without tf.read_file
def parse_func(filename):
f = tf.read_file(filename)
image = tf.image.decode_image(f)
label = #get label from filename
return image, label
Then you can shuffle, batch, and map any parse function to this dataset. You can control how many examples will be preloaded with shuffle buffer. Repeat controls epoch count and better be left None, so it will repeat indefinitely. You can use either plain batch function or combine with
dataset = dataset.shuffle().repeat()
dataset.apply(tf.data.experimental.map_and_batch(map_func=parse_func, batch_size,num_parallel_batches))
Then dataset object can be passed to model.fit
model.fit(dataset, epochs, steps_per_epoch). Note that steps_per_epoch is a necessary parameter in this case, it will define when to start new epoch. So you'll have to know epoch size in advance.

Using Tensorflow Datasets and Estimators with More Data than Ram

I've recently switched my modeling framework to use custom Tensorflow Estimators and Datasets, and am quite happy overall with this workflow.
However, I've just noticed an issue with how my dataset_input_fn loads data form tfrecords. My input function is modeled after the example in the Tensorflow documentation. The issue arises when I have more examples than I can fit into RAM. If I have 1e6 examples, and set my shuffle buffer_size to 1e5, a subset of 1e5 examples is selected once, shuffled, and then iterated on. Meaning my model is only trained on 10% of my overall dataset. My code that sets up this behavior is borrowed exactly from the Tensorflow documentation example code:
dataset = dataset.map(parser)
dataset = dataset.shuffle(buffer_size=10000)
dataset = dataset.batch(32)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
My question: is it possible to fill the shuffle buffer with new examples outside of the initial 1e5 as I train? Is this type of functionality supported with a one_shot_iterator? Do I need to use an initializable iterator?
Thanks!
I have found what appears to be a tenable workaround for now. Through some experimentation, I learned that when instantiating a TFRecordDataset,
filenames = ["file1.tfrecord", ..., "filen.tfrecord"]
dataset = tf.data.TFRecordDataset(filenames)
and setting up a shuffle buffer:
dataset = dataset.shuffle(buffer_size=10000)
the buffer is only populated with the first 10000 examples from however many tf records that requires. For example, in my case, I have ~300 tfrecord files containing 4096 examples each. On examination, my shuffle buffer appears to consists only of examples from the first 3 tf records in my filenames list. Since my filenames list is static, this means that my model is only trained of my first 3 tfrecords!
My fix for now is pretty simple. In my training loop I already alternate between Estimator.train and Estimator.evaluate, and I noticed that each time I call Estimator.train, the shuffle buffer is repopulated. My solution then is to shuffle my filenames each time my input_fn is called. This is not a very elegant solution, but does achieve the desired effect of allowing my to iterate across all tfrecords.
#My Crappy Fix: shuffle file names in input_fn
np.random.shuffle(filenames)
dataset = tf.data.TFRecordDataset(filenames)
What's annoying about this solution (aside from its kludginess) is that my minibatches are not "globally random". Rather, they are selected form a small subset of tf records, and only that subset is used for each training/evaluation cycle. One way to mitigate this is to increase my shuffle buffer size or decrease my tfrecord size, I'll probably do both of these. Finally, I think it's worth noting that if
shuffle_buffer_size < (tf_record_size + minibatch_size)
then, as far as I can tell, my TFRecordDataset will pull from a single tfrecord file!
Finally, I don't think the relevant tensorflow documentation conveys these complexities well. The documentation alludes to the ability to train on large datasets that don't fit into memory, but doesn't provide much detail. It seems unlikely that the tf authors had in mind my hacky strategy when writing this, so I remain curious to see if there's a better approach.

Questions about tensorflow GetStarted tutorial

So I was reading the tensorflow getstarted tutorial and I found it very hard to follow. There were a lot of explanations missing about each function and why they are necesary (or not).
In the tf.estimator section, what's the meaning or what are they supposed to be the "x_eval" and "y_eval" arrays? The x_train and y_train arrays give the desired output (which is the corresponding y coordinate) for a given x coordinate. But the x_eval and y_eval values are incorrect: for x=5, y should be -4, not -4.1. Where do those values come from? What do x_eval and y_eval mean? Are they necesary? How did they choose those values?
The difference between "input_fn" (what does "fn" even mean?) and "train_input_fn". I see that the only difference is one has
num_epochs=None, shuffle=True
num_epochs=1000, shuffle=False
but I don't understand what "input_fn" or "train_input_fn" are/do, or what's the difference between the two, or if both are necesary.
3.In the
estimator.train(input_fn=input_fn, steps=1000)
piece of code, I don't understand the difference between "steps" and "num_epochs". What's the meaning of each one? Can you have num_epochs=1000 and steps=1000 too?
The final question is, how do i get the W and the b? In the previous way of doing it (not using tf.estimator) they explicitelly found that W=-1 and b=1. If I was doing a more complex neural network, involving biases and weights, I think I would want to recover the actual values of the weights and biases. That's the whole point of why I'm using tensorflow, to find the weights! So how do I recover them in the tf.estimator example?
These are just some of the questions that bugged me while reading the "getStarted" tutorial. I personally think it leaves a lot to desire, since it's very unclear what each thing does and you can at best guess.
I agree with you that the tf.estimator is not very well introduced in this "getting started" tutorial. I also think that some machine learning background would help with understanding what happens in the tutorial.
As for the answers to your questions:
In machine learning, we usually minimizer the loss of the model on the training set, and then we evaluate the performance of the model on the evaluation set. This is because it is easy to overfit the training set and get 100% accuracy on it, so using a separate validation set makes it impossible to cheat in this way.
Here (x_train, y_train) corresponds to the training set, where the global minimum is obtained for W=-1, b=1.
The validation set (x_eval, y_eval) doesn't have to perfectly follow the distribution of the training set. Although we can get a loss of 0 on the training set, we obtain a small loss on the validation set because we don't have exactly y_eval = - x_eval + 1
input_fn means "input function". This is to indicate that the object input_fn is a function.
In tf.estimator, you need to provide an input function if you want to train the estimator (estimator.train()) or evaluate it (estimator.evaluate()).
Usually you want different transformations for training or evaluation, so you have two functions train_input_fn and eval_input_fn (the input_fn in the tutorial is almost equivalent to train_input_fn and is just confusing).
For instance, during training we want to train for multiple epochs (i.e. multiple times on the dataset). For evaluation, we only need one pass over the validation data to compute the metrics we need
The number of epochs is the number of times we repeat the entire dataset. For instance if we train for 10 epochs, the model will see each input 10 times.
When we train a machine learning model, we usually use mini-batches of data. For instance if we have 1,000 images, we can train on batches of 100 images. Therefore, training for 10 epochs means training on 100 batches of data.
Once the estimator is trained, you can access the list of variables through estimator.get_variable_names() and the value of a variable through estimator.get_variable_value().
Usually we never need to do that, as we can for instance use the trained estimator to predict on new examples, using estimator.predict().
If you feel that the getting started is confusing, you can always submit a GitHub issue to tell the TensorFlow team and explain your point.

tf.contrib.learn.LinearRegressor builds unexpectedly bad model for a data with one feature

I am building a simple linear regressor for the data from the csv. Data includes weight and height values of some people. Overall learning process is very simple:
MAX_STEPS = 2000
# ...
features = [tf.contrib.layers.real_valued_column(feature_name) for feature_name in FEATURES_COL]
# ...
linear_regressor = tf.contrib.learn.LinearRegressor(feature_columns=features)
linear_regressor.fit(input_fn=prepare_input, max_steps=MAX_STEPS)
However, the model that is built by the regressor is, unexpectedly, bad. Result could be illustrated with the next picture:
Visualization code(just in case):
plt.plot(height_and_weight_df_filtered[WEIGHT_COL],
linear_regressor.predict(input_fn=prepare_full_input),
color='blue',
linewidth=3)
Here is the same data been given to the LinearRegression class from the scikit-learn:
lr_updated = linear_model.LinearRegression()
lr_updated.fit(weight_filtered_reshaped, height_filtered)
And the visualization:
Increasing amount of steps has no effect. I would assume I'm using regressor from the TensorFlow in a wrong way.
iPython notebook with the code.
It looks like your TF model does indeed work and will get there with enough steps. You need to jack it right up though - 200K showed significant improvement, almost as good as the sklearn default.
I think there are two issues:
sklearn looks like it simply solves the equation using ordinary least squares. TF's LinearRegressor uses the FtrlOptimizer. The paper indicates it is a better choice for very large datasets.
The input_fn to the model is injecting the whole training set at once, for every step. This is just a hunch, but I suspect that the FtrlOptimizer may do better if it sees batches at a time.
Instead of just changing the number of steps up a couple orders of magnitude, you can also jack the learning rate up on the optimizer (the default is 0.2) and get similarly good results from only 4k steps:
linear_regressor = tf.contrib.learn.LinearRegressor(
feature_columns=features,
optimizer=tf.train.FtrlOptimizer(learning_rate=5.0))
I met a similar problem. The solution is to check if your input_fn has enough epoch. The training maybe not converge before iterating over the whole training data several times.