Train model in batches using fit_generator - tensorflow

My model has 100 000 training samples of images, how do I modify my code below to train it in batches? With model.fit_generator I have to specify this inside the generator function:
def data_generator(descriptions, features, n_step, max_sequence):
# loop until we finish training
while 1:
# loop over photo identifiers in the dataset
for i in range(0, len(descriptions), n_step):
Ximages, XSeq, y = list(), list(),list()
for j in range(i, min(len(descriptions), i+n_step)):
image = features[j]
# retrieve text input
desc = descriptions[j]
# generate input-output pairs
in_img, in_seq, out_word = preprocess_data([desc], [image], max_sequence)
for k in range(len(in_img)):
Ximages.append(in_img[k])
XSeq.append(in_seq[k])
y.append(out_word[k])
# yield this batch of samples to the model
yield [[array(Ximages), array(XSeq)], array(y)]
My model.fit_generator code:
model.fit_generator(data_generator(texts, train_features, 1, 150),
steps_per_epoch=1500, epochs=50, callbacks=callbacks_list, verbose=1)
Any assistance would be great, I'm training on a cloud 16GB V100 Tesla
Edit: My image caption model creates a training sample for each token in the DSL(250 tokens). With a dataset of 50 images (equivalent to 12500 training samples) and a batch size of 1, I get an OOM. With about 32 (equivalent to 8000 samples and a batch size of 1 it trains just fine.) My question is can I optimize my code better, or is my only option to use multiple GPUs?
Fix:
Steps_per_epoch must be equal to ceil(num_samples / batch_size), so if the dataset has 1500 samples, steps_per_epoch should be equal to 1500. I also reduced my LSTM sliding window from 48 to 24
steps_per_epoch: Integer. Total number of steps (batches of samples)
to yield from generator before declaring one epoch finished and
starting the next epoch. It should typically be equal to
ceil(num_samples / batch_size). Optional for Sequence: if unspecified,
will use the len(generator) as a number of steps.

The generators already return batches.
Every yield is a batch. It's totally up to you to design the generator with the batches the way you want.
In your code, the batch size is n_step.

Here's the proper way of using generators: Make a generator that yields individual datums. Create a Dataset from it and use batch method on that object. Tune the parameter to find the largest batch size that won't cause an OOM.
def data_generator(descriptions, features, max_sequence):
def _gen():
for img, seq, word in zip(*preprocess_data(descriptions, features, max_sequence)):
yield {'image': in_img, 'seq': seq}, wo
return _gen
ds = tf.data.Dataset.from_generator(
data_generator(descriptions, features, max_sequence),
output_types=({'image': tf.float32, 'seq': tf.float32}, tf.int32),
output_shapes=({
'image': tf.TensorShape([blah, blah]),
'seq': tf.TensorShape([blah, blah]),
},
tf.TensorShape([balh])
)
)
ds = ds.batch(n_step)

Related

Stateful LSTM Tensorflow Invalid Input_h Shape Error

I am experimenting with stateful LSTM on a time-series regression problem by using TensorFlow. I apologize that I cannot share the dataset.
Below is my code.
train_feature = train_feature.reshape((train_feature.shape[0], 1, train_feature.shape[1]))
val_feature = val_feature.reshape((val_feature.shape[0], 1, val_feature.shape[1]))
batch_size = 64
model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(50, batch_input_shape=(batch_size, train_feature.shape[1], train_feature.shape[2]), stateful=True))
model.add(tf.keras.layers.Dense(1))
model.compile(optimizer='adam',
loss='mse',
metrics=[tf.keras.metrics.RootMeanSquaredError()])
model.fit(train_feature, train_label,
epochs=10,
batch_size=batch_size)
When I run the above code, after the end of the first epoch, I will get an error as follows.
InvalidArgumentError: [_Derived_] Invalid input_h shape: [1,64,50] [1,49,50]
[[{{node CudnnRNN}}]]
[[sequential_1/lstm_1/StatefulPartitionedCall]] [Op:__inference_train_function_1152847]
Function call stack:
train_function -> train_function -> train_function
However, the model will be successfully trained if I change the batch_size to 1, and change the code for model training to the following.
total_epochs = 10
for i in range(total_epochs):
model.fit(train_feature, train_label,
epochs=1,
validation_data=(val_feature, val_label),
batch_size=batch_size,
shuffle=False)
model.reset_states()
Nevertheless, with a very large data (1 million rows), the model training will take a very long time since the batch_size is 1.
So, I wonder, how to train a stateful LSTM with a batch size larger than 1 (e.g. 64), without getting the invalid input_h shape error?
Thanks for your answers.
The fix is to ensure batch size never changes between batches. They must all be the same size.
Method 1
One way is to use a batch size that perfectly divides your dataset into equal-sized batches. For example, if total size of data is 1500 examples, then use a batch size of 50 or 100 or some other proper divisor of 1500.
batch_size = len(data)/proper_divisor
Method 2
The other way is to ignore any batch that is less than the specified size, and this can be done using the TensorFlow Dataset API and setting the drop_remainder to True.
batch_size = 64
train_data = tf.data.Dataset.from_tensor_slices((train_feature, train_label))
train_data = train_data.repeat().batch(batch_size, drop_remainder=True)
steps_per_epoch = len(train_feature) // batch_size
model.fit(train_data,
epochs=10, steps_per_epoch = steps_per_epoch)
When using the Dataset API like above, you will need to also specify how many rounds of training count as an epoch (essentially how many batches to count as 1 epoch). A tf.data.Dataset instance (the result from tf.data.Dataset.from_tensor_slices) doesn't know the size of the data that it's streaming to the model, so what constitutes as one epoch has to be manually specified with steps_per_epoch.
Your new code will look like this:
train_feature = train_feature.reshape((train_feature.shape[0], 1, train_feature.shape[1]))
val_feature = val_feature.reshape((val_feature.shape[0], 1, val_feature.shape[1]))
batch_size = 64
train_data = tf.data.Dataset.from_tensor_slices((train_feature, train_label))
train_data = train_data.repeat().batch(batch_size, drop_remainder=True)
model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(50, batch_input_shape=(batch_size, train_feature.shape[1], train_feature.shape[2]), stateful=True))
model.add(tf.keras.layers.Dense(1))
model.compile(optimizer='adam',
loss='mse',
metrics=[tf.keras.metrics.RootMeanSquaredError()])
steps_per_epoch = len(train_feature) // batch_size
model.fit(train_data,
epochs=10, steps_per_epoch = steps_per_epoch)
You can also include the validation set as well, like this (not showing other code):
batch_size = 64
val_data = tf.data.Dataset.from_tensor_slices((val_feature, val_label))
val_data = val_data.repeat().batch(batch_size, drop_remainder=True)
validation_steps = len(val_feature) // batch_size
model.fit(train_data, epochs=10,
steps_per_epoch=steps_per_epoch,
validation_steps=validation_steps)
Caveat: This means a few datapoints will never be seen by the model. To get around that, you can shuffle the dataset each round of training, so that the datapoints left behind each epoch changes, giving everyone a chance to be seen by the model.
buffer_size = 1000 # the bigger the slower but more effective shuffling.
train_data = tf.data.Dataset.from_tensor_slices((train_feature, train_label))
train_data = train_data.shuffle(buffer_size=buffer_size, reshuffle_each_iteration=True)
train_data = train_data.repeat().batch(batch_size, drop_remainder=True)
Why the error occurs
Stateful RNNs and their variants (LSTM, GRU, etc.) require fixed batch size. The reason is simply because statefulness is one way to realize Truncated Backprop Through Time, by passing the final hidden state for a batch as the initial hidden state of the next batch. The final hidden state for the first batch has to have exactly the same shape as the initial hidden state of the next batch, which requires that batch size stay the same across batches.
When you set the batch size to 64, model.fit will use the remaining data at the end of an epoch as a batch, and this may not have up to 64 datapoints. So, you get such an error because the batch size is different from what the stateful LSTM expects. You don't have the problem with batch size of 1 because any remaining data at the end of an epoch will always contain exactly 1 datapoint, so no errors. More generally, 1 is always a divisor of any integer. So, if you picked any other divisor of your data size, you should not get the error.
In the error message you posted, it appears the last batch has size of 49 instead of 64. On a side note: The reason the shapes look different from the input is because, under the hood, keras works with the tensors in time_major (i.e. the first axis is for steps of sequence). When you pass a tensor of shape (10, 15, 2) that represents (batch_size, steps_per_sequence, num_features), keras reshapes it to (15, 10, 2) under the hood.

TensorFlow DataSet shuffle - data shuffling only starting from second epoch

I am using the tensorflow DataSet for input data pipeline. I am wondering how to run training without data shuffling in first epoch and start shuffling data from the second epoch.
the graph is usually built before iterative training start and during training it seems not straight-forward how to change the DataSet shuffling behavior since it looks to me kinds of like changing the graph.
any idea?
thanks,
Harry
The buffer_size argument to Dataset.shuffle() can be a computed tf.Tensor, so you can use the following code that uses Dataset.range(NUM_EPOCHS).flat_map(...) to transform a sequence of epoch numbers to the (shuffled or otherwise) elements of a per_epoch_dataset:
NUM_EPOCHS = ... # The total number of epochs.
BUFFER_SIZE = ... # The shuffle buffer size to use from the second epoch on.
per_epoch_dataset = ... # A `Dataset` representing the elements of a single epoch.
def shuffle_after_first_epoch(epoch):
# Set `epoch_buffer_size` to 1 (i.e. no shuffling) in the 0th epoch,
# and `BUFFER_SIZE` thereafter.
epoch_buffer_size = tf.cond(tf.equal(epoch, 0),
lambda: tf.constant(1, tf.int64),
lambda: tf.constant(BUFFER_SIZE, tf.int64))
return per_epoch_dataset.shuffle(epoch_buffer_size)
dataset = tf.data.Dataset.range(NUM_EPOCHS).flat_map(shuffle_after_first_epoch)

Keras fit_generator using a lot of memory even with small batch sizes

Previously I manually trained my model using model.fit() inside a for loop to train it on small batches of data, due to memory constraints. The problem with this is that I can't have access to all previous histories through history.history, because it's like each time a new model is trained, and previous histories aren't stored anywhere.
When I use model.fit() on a 500 batch size, around 7 GB of my ram gets full. I use keras with tensorflow-cpu back end.
But when I use a generator, even with a batch size of 50 won't fit in memory, and gets swapped onto the disk.
I'm performing classification, using 224*224 images, and I am trying to fine tune vgg face. I'm using vgg face implemented according to this link:
VGG-Face
I'm using ResNet and SeNet architectures, as described in the link.
I've previously shuffled my data. I've put aside %20 of my data for test.
My data, image addresses and labels, are stored in a list. The %20 of my training data will be used for validation. For example if batch size is equal to 50, train_data_generator will create a batch with size 40 from the first %80 portion of training data, and vl_data_generator will create a batch with size 10 from the last %20 portion of training data. I've written a class, and by creating an instance and invoking train method
through it, I perform training. Here are generator and training parts of my code, excluding model definitions:
def prepare_input_data(self, batch_addresses):
image = []
for j in range(len(batch_addresses)):
img = cv2.imread(batch_addresses[j])
img = cv2.resize(img, (224, 224))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = img - np.array([103.939, 116.779, 123.68])
image.append(img)
data = np.array(image)
data = data.astype('float32')
data /= 255
return data
def train_data_generator(self, addresses, labels, batch_size):
"""Train data generator"""
#Use first %80 of data for training.
addresses = addresses[: int(0.8 * len(addresses))]
labels = labels[: int(0.8 * len(labels))]
total_data = len(addresses)
while 1:
for i in range(total_data / batch_size):
batch_addresses = addresses[i * batch_size: (i + 1) * batch_size]
batch_labels = labels[i * batch_size: (i + 1) * batch_size]
data = self.prepare_input_data(batch_addresses)
batch_labels = np_utils.to_categorical(batch_labels, self.nb_class)
yield data, batch_labels
def val_data_generator(self, addresses, labels, batch_size):
"""Validation data generator"""
#Use the last %20 of data for validation
addresses = addresses[int(0.8 * len(addresses)):]
labels = labels[int(0.8 * len(labels)):]
total_data = len(addresses)
image = []
while 1:
for i in range(total_data / batch_size):
batch_addresses = addresses[i * batch_size: (i + 1) * batch_size]
batch_labels = labels[i * batch_size: (i + 1) * batch_size]
data = self.prepare_input_data(batch_addresses)
batch_labels = np_utils.to_categorical(batch_labels, self.nb_class)
yield data, batch_labels
def train(self, label_interested_in):
"""Trains the model"""
#Read training data from json file, and get addresses and labels
addresses, labels = self.create_address_and_label(label_interested_in)
batch_size = 50
train_batch_size = 40
val_batch_size = 10
steps = int(len(addresses) / batch_size) + 1
print(len(addresses), steps)
#Perform training
history = self.custom_vgg_model.fit_generator(
self.train_data_generator(addresses, labels, train_batch_size),
steps_per_epoch=steps, epochs=self.number_of_epochs,
verbose=1, validation_data=self.val_data_generator(addresses, labels, val_batch_size),
validation_steps=steps, initial_epoch=0)
Why am I seeing such high memory usage? Is it because the way generators work in keras? I read that generators prepare batches beforehand to speedup the training process by running in parallel with the training. Or am I doing something wrong?
As a side question, since there isn't a batch_size argument in fit_generator(), am I correct in assuming that data gets loaded into the model based on generators and gradient updates are performed after each training and validation batch is loaded?
Try workers=0
This will not invoke any multiprocessing which is intended to fill up the queue beforehand up to the max_queue_size argument with using k workers.
What this does is; prepare a queue of generated data on CPU while training is ongoing on GPU so no time is lost and avoid bottlenecks.
For your need workers=0 will work
For deeper inquiry refer to
keras fit_generator

Tensorflow: how training data is processed if batch_size x train_steps is greater than number of records?

Let's say I have training set of 1000 lines, batch_size is 100 and train_steps is 12. As i understand in this case csv will be read 12 times by 100 lines each time? But there is only 1000 lines, not 1200. But from my small experience I can set train_steps to any number like 100000 and training will pass fine. So does it mean that csv will be read from the beginning each 10th step?
def generate_input_fn(filenames,
num_epochs=None,
shuffle=True,
skip_header_lines=0,
batch_size=200):
filename_queue = tf.train.string_input_producer( filenames, num_epochs=num_epochs, shuffle=shuffle)
reader = tf.TextLineReader(skip_header_lines=skip_header_lines)
_, rows = reader.read_up_to(filename_queue, num_records=batch_size)
features = parse_csv(rows)
#
# shuffle=false
#
features = tf.train.batch(
features,
batch_size,
capacity=batch_size * 10,
num_threads=multiprocessing.cpu_count(),
enqueue_many=True,
allow_smaller_final_batch=True
)
return features, parse_label_column( features.pop(LABEL_COLUMN) )
It's based on census example https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census
At each epoch you'll start again reading your CSV files from start, but they will be re-shuffled between each epoch by tf.train.string_input_producer (note that the order of the files will be shuffled, not their content). Although it would be better to have a big enough dataset so that you don't need the same data twice, it's actually very rare that you do have such a big training set, so using the same data several (or even a lot of) times is usual and still works well (up to a certain point).

rationale behind the evaluation in tensorflow's tutorial code cifar10_eval.py

In TF's official tutorial code 'cifar10', there is an evaluation snippet:
def evaluate():
with tf.Graph().as_default() as g:
# Get images and labels for CIFAR-10.
eval_data = FLAGS.eval_data == 'test'
images, labels = cifar10.inputs(eval_data=eval_data)
# Build a Graph that computes the logits predictions from the
# inference model.
logits = cifar10.inference(images)
# Calculate predictions.
top_k_op = tf.nn.in_top_k(logits, labels, 1)
# Restore the moving average version of the learned variables for eval.
variable_averages = tf.train.ExponentialMovingAverage(
cifar10.MOVING_AVERAGE_DECAY)
variables_to_restore = variable_averages.variables_to_restore()
saver = tf.train.Saver(variables_to_restore)
# Build the summary operation based on the TF collection of Summaries.
summary_op = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter(FLAGS.eval_dir, g)
while True:
eval_once(saver, summary_writer, top_k_op, summary_op)
if FLAGS.run_once:
break
time.sleep(FLAGS.eval_interval_secs)
At runtime, it evaluates one batch of test samples and prints out 'precision' in the console every other eval_interval_secs, my questions are:
each time eval_once() is executed, one batch of samples (128) are dequeued from the data queue, but why I didn't see the evaluation stop after enough batches, 10000/128 + 1 = 79 batches? I thought it should stop after 79 batches.
Are batches from the first 79 sampling mutually exclusive? I'd assume so but want to double-check this.
If each batch is indeed dequeued from the data queue, what are the samples after 79 times of sampling? some random sampling from the entire duplicate data queue again?
since in_top_k() is taking in some unnormalized logit values and output a string of booleans, this masks the internal conversions of softmax() + thresholding. Is there a TF op for such explicit computations? Ideally, it'd be useful to be able to tune the threshold and see different classification results.
Please help.
Thanks!
You can see the following line in "inputs" def of cifar10_input.py
filename_queue = tf.train.string_input_producer(filenames)
More about tf.train.string_input_producer :
string_input_producer(
string_tensor,
num_epochs=None,
shuffle=True,
seed=None,
capacity=32,
shared_name=None,
name=None,
cancel_op=None
)
num_epochs : produces each string from string_tensor num_epochs times before generating an OutOfRange error. If not specified, string_input_producer can cycle through the strings in string_tensor an unlimited number of times.
In our case, num_epochs is not specified. That's why it does not stop after few batches. It can run unlimited times.
By default, shuffle option is set to True in tf.train.string_input_producer. So, it shuffles the data first and copies that shuffled 10K filenames again and again.
Therefore, it's mutually exclusive. You can print filenames to see this.
As explained in 1, they are repeated samples. (not any random data)
You could avoid using tf.nn.in_top_k. Use tf.nn.softmax and tf.greater_equal to obtain boolean tensor that has softmax value above the specific threshold.
I hope this helps. Please comment if there is any misunderstanding.