We have the following input pipeline:
with tf.name_scope('input'):
filename_queue = tf.train.string_input_producer(
[filename], num_epochs=num_epochs)
# Even when reading in multiple threads, share the filename
# queue.
image, label = read_and_decode(filename_queue)
# Shuffle the examples and collect them into batch_size batches.
# (Internally uses a RandomShuffleQueue.)
# We run this in two threads to avoid being a bottleneck.
images, sparse_labels = tf.train.shuffle_batch(
[image, label], batch_size=batch_size, num_threads=2,
capacity=1000 + 3 * batch_size,
# Ensures a minimum amount of shuffling of examples.
min_after_dequeue=1000)
return images, sparse_labels
and we have the following training:
# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
step = 0
while not coord.should_stop():
start_time = time.time()
# Run one step of the model. The return values are
# the activations from the `train_op` (which is
# discarded) and the `loss` op. To inspect the values
# of your ops or variables, you may include them in
# the list passed to sess.run() and the value tensors
# will be returned in the tuple from the call.
_, loss_value = sess.run([train_op, loss])
duration = time.time() - start_time
# Print an overview fairly often.
if step % 100 == 0:
print('Step %d: loss = %.2f (%.3f sec)' % (step, loss_value,
duration))
step += 1
except tf.errors.OutOfRangeError:
print('Done training for %d epochs, %d steps.' % (FLAGS.num_epochs, step))
finally:
# When done, ask the threads to stop.
coord.request_stop()
# Wait for threads to finish.
coord.join(threads)
sess.close()
I have two doubts:
1) Is the variable num_epochs deciding the number of training iterations?
2) My model is pretty large and i want to checkpoint and restore and train.
How do I know for a restored model how many iterations are done and how many are left?
1) as stated in the tensorflow api tf.train.string_input_producer will throw a tf.errors.OutOfRangeError once each string has been produced for num_epoch times. So yes, num_epochs will be deciding the number of training iterations in your code.
2) I think it might be possible to declare a tf.Variable and increase its value for each epoch that you run, so when you restore your model you could read that value again and train for the remaining epochs. Unfortunately i dont know if there is a smarter way, since most people only save their models for predictions after the training, or do finetuning for a fix number of epochs.
Hope i could help
Related
I am not sure since when am having this issue and I have to believe that this happened at some point between today and a few months ago but it would seem that the RAM (CPU) consumption grows over time during epochs.
self.model.fit(
train_data,
initial_epoch=self.status.valid_last.epoch,
epochs=train_config.epochs,
steps_per_epoch=train_config.steps_per_epoch,
callbacks=self._get_experiment_callbacks(),
validation_data=valid_data,
validation_steps=train_config.validation_steps,
)
The only thing out of the ordinary here might be the callbacks I am passing but there's actually nothing special here. One is a TensorBoard (TB) callback and the other is a custom Metric which is not doing much except plotting the learning rate and other general metrics to TB.
def _get_experiment_callbacks(self) -> List[tf.keras.callbacks.Callback]:
tensorboard_cb = tf.keras.callbacks.TensorBoard(
log_dir=os.path.join(out_dir, "logs"),
update_freq="epoch",
profile_batch=profile_batch,
write_images=True,
)
# Not interested in whatever is plotted in those
tensorboard_cb.on_epoch_end = lambda *args: ...
tensorboard_cb.on_test_end = lambda *args: ...
return [
tensorboard_cb,
Metrics(tensorboard_cb, update_freq=100),
]
This leaves us with the last suspect which is the valid_data itself. This is essentially just a list of protobuf files (shards) which I am loading like so:
def load_shards(
decode_example_fn: Callable,
shard_fps: List[str],
training: bool,
buffer_size: int = None # 50 * 1000 ** 2,
) -> tf.data.Dataset:
if not len(shard_fps) > 0:
raise ValueError("Argument shard_fps must be a list to shards but is empty.")
def make_dense_(example):
for k, v in example.items():
if isinstance(v, tf.SparseTensor):
example[k] = tf.sparse.to_dense(v)
return example
def load_records_(filenames):
record_dataset = tf.data.TFRecordDataset(filenames, buffer_size=buffer_size)
record_dataset = record_dataset.map(decode_example_fn)
record_dataset = record_dataset.map(make_dense_)
return record_dataset
if not training:
shard_fps = sorted(shard_fps)
dataset = tf.data.Dataset.from_tensor_slices(tf.constant(shard_fps))
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
dataset = dataset.with_options(options)
if training:
dataset = dataset.interleave(load_records_, num_parallel_calls=tf.data.AUTOTUNE, deterministic=False)
else:
dataset = dataset.apply(load_records_)
dataset = dataset.prefetch(tf.data.AUTOTUNE)
return dataset
and from then on there's just preprocessing and transformation mappings on the inputs. So.. I would not expect any memory leak at this point
Still, I am observing a continuous increase of memory consumption over time. The screenshot below shows the consumption after a restart.
At first we use ~28GB of RAM. After 100 steps there's a sharp increase, to ~33GB and from there it kind of seems to stabilize at around 38GB. The next big jump at 216k steps is coming from an evaluation. From there it's just constantly growing ..
From the looks it appears as if the memory usage stabilized and the jump only occurs after each epoch (1 epoch = 6000 steps).
There could be any number of things that could be wrong. TensorBoard could possibly not be reusing the same graph, but instead is adding graphs, which leads to OOM. I don't use TensorBoard myself because I remember this as happening to me a few years back. It's also possible that using model.fit is the problem and that you're loading your data at every epoch. You could try writing the training loop something like:
for epoch in tf.range(epochs):
batch_train_loss = []
batch_train_acc = []
for batch, (X, Y) in train_dataset.enumerate():
train_loss = train_fn(X, Y, model, loss, optimizer, metric, batch) # do the actual training
train_acc = metric.result().numpy() # get the training accuracy
batch_train_loss.append(train_loss) # save the training loss above
batch_train_acc.append(train_acc) # save the training accuracy above
metric.reset_states() # reset the metric after every batch
where the train_fn is:
def get_apply_train_fn():
#tf.function
def train_function(X, Y, model, loss, optimizer, metric, step):
with tf.GradientTape() as tape:
predictions = model(X, training=True)
loss_value = loss(Y, predictions)
gradients = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_acc = metric.update_state(Y, predictions)
return loss_value
return train_function
train_fn = get_apply_train_fn()
Now, this is a stupidly complicated way of writing model.fit, but it does work.
Another way in which I've had to combat OOM on GPU side is use Python's multiprocessing, but this was in a context where I was doing 10-fold cross-validation and the training would crash after 7 or 8 folds with OOM.
Alternatively, you could try turning eager execution on or off with
tf.config.run_functions_eagerly(False) # or True
Below code snippet is the custom training loop from Tensorflow official tutorial.https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch . Another tutorial also does not average loss over batch_size, as shown here https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough
Why is the loss_value not averaged over batch_size at this line loss_value = loss_fn(y_batch_train, logits)? Is this a bug? From another question here Loss function works with reduce_mean but not reduce_sum, reduce_mean is indeed needed to average loss over batch_size
The loss_fn is defined in the tutorial as below. It obviously does not average over batch_size.
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True)
From documentation, keras.losses.SparseCategoricalCrossentropy sums loss over the batch without averaging. Thus, this is essentially reduce_sum instead of reduce_mean!
Type of tf.keras.losses.Reduction to apply to loss. Default value is AUTO. AUTO indicates that the reduction option will be determined by the usage context. For almost all cases this defaults to SUM_OVER_BATCH_SIZE.
The code is shown below.
epochs = 2
for epoch in range(epochs):
print("\nStart of epoch %d" % (epoch,))
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
# Open a GradientTape to record the operations run
# during the forward pass, which enables auto-differentiation.
with tf.GradientTape() as tape:
# Run the forward pass of the layer.
# The operations that the layer applies
# to its inputs are going to be recorded
# on the GradientTape.
logits = model(x_batch_train, training=True) # Logits for this minibatch
# Compute the loss value for this minibatch.
loss_value = loss_fn(y_batch_train, logits)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss.
grads = tape.gradient(loss_value, model.trainable_weights)
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients(zip(grads, model.trainable_weights))
# Log every 200 batches.
if step % 200 == 0:
print(
"Training loss (for one batch) at step %d: %.4f"
% (step, float(loss_value))
)
print("Seen so far: %s samples" % ((step + 1) * 64))
I've figured it out, the loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=True) indeed averages loss over batch_size by default.
I am running a first test of a convolutional neural network with tensor flow. I adapted the recommended method with queue runners from the programming guide (see session definition below). Output is the last result from the cnn (here is only this last step given). label_batch_vector is the training label batch.
output = tf.matmul(h_pool2_flat, W_fc1) + b_fc1
label_batch_vector = tf.one_hot(label_batch, 33)
correct_prediction = tf.equal(tf.argmax(output, 1), tf.argmax(label_batch_vector, 1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
print_accuracy = tf.Print(accuracy, [accuracy])
# Create a session for running operations in the Graph.
sess = tf.Session()
# Initialize the variables (like the epoch counter).
sess.run(init_op)
# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
while not coord.should_stop():
# Run training steps or whatever
sess.run(train_step)
sess.run(print_accuracy)
except tf.errors.OutOfRangeError:
print('Done training -- epoch limit reached')
finally:
# When done, ask the threads to stop.
coord.request_stop()
# Wait for threads to finish.
coord.join(threads)
sess.close()
My problem is that accuracy is calculated for each batch and I would like it calculated for each epoch. I would need to do the following: initialize a epoch_accuracy tensor, for each of the calculated batch accuracies in the epoch add it to the epoch_accuracy. At the end of the epoch show the calculated training set accuracy. However I am not finding any such example with the this queue threads that I implemented (which is actually the recommended method from TensorFlow). Can anyone help ?
To compute accuracy on the stream of data (your sequence of batches, here), you can use the tf.metrics.accuracy function in tensorflow. See its doc here
You define the op like this
_, accuracy = tf.metrics.accuracy(y_true, y_pred)
Then you can update the accuracy in this way:
sess.run(accuracy)
PS: all functions in tf.metrics (auc, recall, etc.) support streaming
I know how to use an input pipeline to read data from files:
input = ... # Read from file
loss = network(input) # build a network
train_op = ... # Using SGD or other algorithms to train the network.
But how can I switch between multiple input pipelines? Say, if I want to train a network for 1000 batches on the training set from the training pipeline, then validate it on a validation set from another pipeline, then keep training, then validate, then train, ..., and so forth.
It's easy to implement this with feed_dict. I also know how to use checkpoints to achieve this, just like in the cifar-10 example. But it's kind of cumbersome: I need to dump the model to disk then read it from disk again.
Can I just switch between two input pipelines (one for training data, one for validation data) to achieve this? Reading 1000 batches from the training data queue, then a few batched from the validation data queue, and so forth. If it is possible, how to do it?
Not sure if this is exactly what you are looking for, but I am doing training and validation in the same code in two separate loops. My code reads numeric and string data from .CSV files, not images. I am reading from two separate CSV files, one for training and one for validation. I'm sure you can generalize it to read from two 'sets' of files, rather than just single files, as the code is there.
Here are the code snippets in case it helps. Note that this code first reads everything as string and then converts the necessary cells into floats, just given my own requirements. If your data is purely numeric, you should just set the defaults to floats and all should be easier. Also, there are a couple of lines in there that drop Weights and Biases into a CSV file AND serialize them into the TF checkpoint file, depending on which way you'd prefer.
#first define the defaults:
rDefaults = [['a'] for row in range((TD+TS+TL))]
# this function reads line-by-line from CSV and separates cells into chunks:
def read_from_csv(filename_queue):
reader = tf.TextLineReader(skip_header_lines=False)
_, csv_row = reader.read(filename_queue)
data = tf.decode_csv(csv_row, record_defaults=rDefaults)
dateLbl = tf.slice(data, [0], [TD])
features = tf.string_to_number(tf.slice(data, [TD], [TS]), tf.float32)
label = tf.string_to_number(tf.slice(data, [TD+TS], [TL]), tf.float32)
return dateLbl, features, label
#this function loads the above lines and spits them out as batches of N:
def input_pipeline(fName, batch_size, num_epochs=None):
filename_queue = tf.train.string_input_producer(
[fName],
num_epochs=num_epochs,
shuffle=True)
dateLbl, features, label = read_from_csv(filename_queue)
min_after_dequeue = 10000
capacity = min_after_dequeue + 3 * batch_size # max of how much to load into memory
dateLbl_batch, feature_batch, label_batch = tf.train.shuffle_batch(
[dateLbl, features, label],
batch_size=batch_size,
capacity=capacity,
min_after_dequeue=min_after_dequeue)
return dateLbl_batch, feature_batch, label_batch
# These are the TRAINING features, labels, and meta-data to be loaded from the train file:
dateLbl, features, labels = input_pipeline(fileNameTrain, batch_size, try_epochs)
# These are the TESTING features, labels, and meta-data to be loaded from the test file:
dateLblTest, featuresTest, labelsTest = input_pipeline(fileNameTest, batch_size, 1) # 1 epoch here regardless of training
# then you define the model, start the session, blah blah
# fire up the queue:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
#This is the TRAINING loop:
try:
while not coord.should_stop():
dateLbl_batch, feature_batch, label_batch = sess.run([dateLbl, features, labels])
_, acc, summary = sess.run([train_step, accuracyTrain, merged_summary_op], feed_dict={x: feature_batch, y_: label_batch,
keep_prob: dropout,
learning_rate: lRate})
except tf.errors.OutOfRangeError: # (so done reading the file(s))
# by the way, this dumps weights and biases into a CSV file, since you asked for that
np.savetxt(fPath + fIndex + '_weights.csv', sess.run(W),
# and this serializes weight and biases into the TF-formatted protobuf:
# tf.train.Saver({'varW': W, 'varB': b}).save(sess, fileNameCheck)
finally:
coord.request_stop()
# now re-start the runners for the testing file:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
try:
while not coord.should_stop():
# so now this line reads features, labels, and meta-data, but this time from the training file:
dateLbl_batch, feature_batch, label_batch = sess.run([dateLblTest, featuresTest, labelsTest])
guessY = tf.argmax(y, 1).eval({x: feature_batch, keep_prob: 1})
trueY = tf.argmax(label_batch, 1).eval()
accuracy = round(tf.reduce_mean(tf.cast(tf.equal(guessY, trueY), tf.float32)).eval(), 2)
except tf.errors.OutOfRangeError:
acCumTest /= i
finally:
coord.request_stop()
coord.join(threads)
This may differ from what you are trying to do in the sense that it first completes the Training loop and THEN restarts the queues for the Testing loop. Not sure how you'd do this if you want to go back and fourth, but you can try to experiment with the two functions defined above by passing them the relevant file names (or lists) interchangeably.
Also I'm not sure if re-starting the queues after training is the best way to go, but it works for me. Would love to see a better example out there, as most TF examples use some built-in wrappers around the MNIST dataset to do the training in one go...
The following snippet has been taken from the TensorFlow 0.12 API documentation
def input_pipeline(filenames, batch_size, num_epochs=None):
filename_queue = tf.train.string_input_producer(
filenames, num_epochs=num_epochs, shuffle=True)
example, label = read_my_file_format(filename_queue)
# min_after_dequeue defines how big a buffer we will randomly sample
# from -- bigger means better shuffling but slower start up and more
# memory used.
# capacity must be larger than min_after_dequeue and the amount larger
# determines the maximum we will prefetch. Recommendation:
# min_after_dequeue + (num_threads + a small safety margin) * batch_size
min_after_dequeue = 10000
capacity = min_after_dequeue + 3 * batch_size
example_batch, label_batch = tf.train.shuffle_batch(
[example, label], batch_size=batch_size, capacity=capacity,
min_after_dequeue=min_after_dequeue)
return example_batch, label_batch
The question I have might be very basic for a regular TensorFlow user, but I am an absolute beginner. The question is the following :
tf.train.string_input_producer creates a queue for holding the filenames. As the input_pipeline() is called over and over again during training, how will it be ensured that everytime the same queue is used ? I guess, it is important since, if different calls to input_pipeline() result in a creation of a new queue, there does not seem to be a way to ensure that different images are picked everytime and epoch counter and shuffling can be properly maintained.
The input_pipeline function only creates the part of a (usually larger) graph that is responsible for producing batches of data. If you were to call input_pipeline twice - for whatever reason - you would be creating two different queues indeed.
In general, the function tf.train.string_input_producer actually creates a queue node (or operation) in the currently active graph (which is the default graph unless you specify something different). read_my_file_format then reads from that queue and sequentially produces single "example" tensors, while tf.train.shuffle_batch then batches these into bundles of length batch_size each.
However, the output of tf.train.shuffle_batch, two Tensors here that are returned from the input_pipeline function, only really takes on a (new) value when it is evaluated under a session. If you evaluate these tensors multiple times, they will contain different values - taken, through read_my_file_format, from files listed in the input queue.
Think of it like so:
X_batch, Y_batch = input_pipeline(..., batch_size=100)
with tf.Session() as sess:
sess.run(tf.global_variable_initializer())
tf.train.start_queue_runners()
# get the first 100 examples and labels
X1, Y1 = sess.run((X_batch, Y_batch))
# get the next 100 examples and labels
X2, Y2 = sess.run((X_batch, Y_batch))
# etc.
The boilerplate code to get it running is a bit more complex, e.g. because queues need to actually be started and stopped in the graph, because they will throw a tf.errors.OutOfRangeError when they run dry, etc.
A more complete example could look like this:
with tf.Graph().as_default() as graph:
X_batch, Y_batch = input_pipeline(..., batch_size=100)
prediction = inference(X_batch)
optimizer, loss = optimize(prediction, Y_batch)
coord = tf.train.Coordinator()
with tf.Session(graph=graph) as sess:
init = tf.group(tf.local_variable_initializer(),
tf.global_variable_initializer())
sess.run(init)
# start the queue runners
threads = tf.train.start_queue_runners(coord=coord)
try:
while not coord.should_stop():
# now you're really indirectly querying the
# queue; each iteration will see a new batch of
# at most 100 values.
_, loss = sess.run((optimizer, loss))
# you might also want to do something with
# the network's output - again, this would
# use a fresh batch of inputs
some_predicted_values = sess.run(prediction)
except tf.errors.OutOfRangeError:
print('Training stopped, input queue is empty.')
finally:
coord.request_stop()
# stop the queue(s)
coord.request_stop()
coord.join(threads)
For a deeper understanding, you might want to look at the Reading data documentation.