The following snippet has been taken from the TensorFlow 0.12 API documentation
def input_pipeline(filenames, batch_size, num_epochs=None):
filename_queue = tf.train.string_input_producer(
filenames, num_epochs=num_epochs, shuffle=True)
example, label = read_my_file_format(filename_queue)
# min_after_dequeue defines how big a buffer we will randomly sample
# from -- bigger means better shuffling but slower start up and more
# memory used.
# capacity must be larger than min_after_dequeue and the amount larger
# determines the maximum we will prefetch. Recommendation:
# min_after_dequeue + (num_threads + a small safety margin) * batch_size
min_after_dequeue = 10000
capacity = min_after_dequeue + 3 * batch_size
example_batch, label_batch = tf.train.shuffle_batch(
[example, label], batch_size=batch_size, capacity=capacity,
min_after_dequeue=min_after_dequeue)
return example_batch, label_batch
The question I have might be very basic for a regular TensorFlow user, but I am an absolute beginner. The question is the following :
tf.train.string_input_producer creates a queue for holding the filenames. As the input_pipeline() is called over and over again during training, how will it be ensured that everytime the same queue is used ? I guess, it is important since, if different calls to input_pipeline() result in a creation of a new queue, there does not seem to be a way to ensure that different images are picked everytime and epoch counter and shuffling can be properly maintained.
The input_pipeline function only creates the part of a (usually larger) graph that is responsible for producing batches of data. If you were to call input_pipeline twice - for whatever reason - you would be creating two different queues indeed.
In general, the function tf.train.string_input_producer actually creates a queue node (or operation) in the currently active graph (which is the default graph unless you specify something different). read_my_file_format then reads from that queue and sequentially produces single "example" tensors, while tf.train.shuffle_batch then batches these into bundles of length batch_size each.
However, the output of tf.train.shuffle_batch, two Tensors here that are returned from the input_pipeline function, only really takes on a (new) value when it is evaluated under a session. If you evaluate these tensors multiple times, they will contain different values - taken, through read_my_file_format, from files listed in the input queue.
Think of it like so:
X_batch, Y_batch = input_pipeline(..., batch_size=100)
with tf.Session() as sess:
sess.run(tf.global_variable_initializer())
tf.train.start_queue_runners()
# get the first 100 examples and labels
X1, Y1 = sess.run((X_batch, Y_batch))
# get the next 100 examples and labels
X2, Y2 = sess.run((X_batch, Y_batch))
# etc.
The boilerplate code to get it running is a bit more complex, e.g. because queues need to actually be started and stopped in the graph, because they will throw a tf.errors.OutOfRangeError when they run dry, etc.
A more complete example could look like this:
with tf.Graph().as_default() as graph:
X_batch, Y_batch = input_pipeline(..., batch_size=100)
prediction = inference(X_batch)
optimizer, loss = optimize(prediction, Y_batch)
coord = tf.train.Coordinator()
with tf.Session(graph=graph) as sess:
init = tf.group(tf.local_variable_initializer(),
tf.global_variable_initializer())
sess.run(init)
# start the queue runners
threads = tf.train.start_queue_runners(coord=coord)
try:
while not coord.should_stop():
# now you're really indirectly querying the
# queue; each iteration will see a new batch of
# at most 100 values.
_, loss = sess.run((optimizer, loss))
# you might also want to do something with
# the network's output - again, this would
# use a fresh batch of inputs
some_predicted_values = sess.run(prediction)
except tf.errors.OutOfRangeError:
print('Training stopped, input queue is empty.')
finally:
coord.request_stop()
# stop the queue(s)
coord.request_stop()
coord.join(threads)
For a deeper understanding, you might want to look at the Reading data documentation.
Related
I am not sure since when am having this issue and I have to believe that this happened at some point between today and a few months ago but it would seem that the RAM (CPU) consumption grows over time during epochs.
self.model.fit(
train_data,
initial_epoch=self.status.valid_last.epoch,
epochs=train_config.epochs,
steps_per_epoch=train_config.steps_per_epoch,
callbacks=self._get_experiment_callbacks(),
validation_data=valid_data,
validation_steps=train_config.validation_steps,
)
The only thing out of the ordinary here might be the callbacks I am passing but there's actually nothing special here. One is a TensorBoard (TB) callback and the other is a custom Metric which is not doing much except plotting the learning rate and other general metrics to TB.
def _get_experiment_callbacks(self) -> List[tf.keras.callbacks.Callback]:
tensorboard_cb = tf.keras.callbacks.TensorBoard(
log_dir=os.path.join(out_dir, "logs"),
update_freq="epoch",
profile_batch=profile_batch,
write_images=True,
)
# Not interested in whatever is plotted in those
tensorboard_cb.on_epoch_end = lambda *args: ...
tensorboard_cb.on_test_end = lambda *args: ...
return [
tensorboard_cb,
Metrics(tensorboard_cb, update_freq=100),
]
This leaves us with the last suspect which is the valid_data itself. This is essentially just a list of protobuf files (shards) which I am loading like so:
def load_shards(
decode_example_fn: Callable,
shard_fps: List[str],
training: bool,
buffer_size: int = None # 50 * 1000 ** 2,
) -> tf.data.Dataset:
if not len(shard_fps) > 0:
raise ValueError("Argument shard_fps must be a list to shards but is empty.")
def make_dense_(example):
for k, v in example.items():
if isinstance(v, tf.SparseTensor):
example[k] = tf.sparse.to_dense(v)
return example
def load_records_(filenames):
record_dataset = tf.data.TFRecordDataset(filenames, buffer_size=buffer_size)
record_dataset = record_dataset.map(decode_example_fn)
record_dataset = record_dataset.map(make_dense_)
return record_dataset
if not training:
shard_fps = sorted(shard_fps)
dataset = tf.data.Dataset.from_tensor_slices(tf.constant(shard_fps))
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
dataset = dataset.with_options(options)
if training:
dataset = dataset.interleave(load_records_, num_parallel_calls=tf.data.AUTOTUNE, deterministic=False)
else:
dataset = dataset.apply(load_records_)
dataset = dataset.prefetch(tf.data.AUTOTUNE)
return dataset
and from then on there's just preprocessing and transformation mappings on the inputs. So.. I would not expect any memory leak at this point
Still, I am observing a continuous increase of memory consumption over time. The screenshot below shows the consumption after a restart.
At first we use ~28GB of RAM. After 100 steps there's a sharp increase, to ~33GB and from there it kind of seems to stabilize at around 38GB. The next big jump at 216k steps is coming from an evaluation. From there it's just constantly growing ..
From the looks it appears as if the memory usage stabilized and the jump only occurs after each epoch (1 epoch = 6000 steps).
There could be any number of things that could be wrong. TensorBoard could possibly not be reusing the same graph, but instead is adding graphs, which leads to OOM. I don't use TensorBoard myself because I remember this as happening to me a few years back. It's also possible that using model.fit is the problem and that you're loading your data at every epoch. You could try writing the training loop something like:
for epoch in tf.range(epochs):
batch_train_loss = []
batch_train_acc = []
for batch, (X, Y) in train_dataset.enumerate():
train_loss = train_fn(X, Y, model, loss, optimizer, metric, batch) # do the actual training
train_acc = metric.result().numpy() # get the training accuracy
batch_train_loss.append(train_loss) # save the training loss above
batch_train_acc.append(train_acc) # save the training accuracy above
metric.reset_states() # reset the metric after every batch
where the train_fn is:
def get_apply_train_fn():
#tf.function
def train_function(X, Y, model, loss, optimizer, metric, step):
with tf.GradientTape() as tape:
predictions = model(X, training=True)
loss_value = loss(Y, predictions)
gradients = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_acc = metric.update_state(Y, predictions)
return loss_value
return train_function
train_fn = get_apply_train_fn()
Now, this is a stupidly complicated way of writing model.fit, but it does work.
Another way in which I've had to combat OOM on GPU side is use Python's multiprocessing, but this was in a context where I was doing 10-fold cross-validation and the training would crash after 7 or 8 folds with OOM.
Alternatively, you could try turning eager execution on or off with
tf.config.run_functions_eagerly(False) # or True
We have the following input pipeline:
with tf.name_scope('input'):
filename_queue = tf.train.string_input_producer(
[filename], num_epochs=num_epochs)
# Even when reading in multiple threads, share the filename
# queue.
image, label = read_and_decode(filename_queue)
# Shuffle the examples and collect them into batch_size batches.
# (Internally uses a RandomShuffleQueue.)
# We run this in two threads to avoid being a bottleneck.
images, sparse_labels = tf.train.shuffle_batch(
[image, label], batch_size=batch_size, num_threads=2,
capacity=1000 + 3 * batch_size,
# Ensures a minimum amount of shuffling of examples.
min_after_dequeue=1000)
return images, sparse_labels
and we have the following training:
# Start input enqueue threads.
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
try:
step = 0
while not coord.should_stop():
start_time = time.time()
# Run one step of the model. The return values are
# the activations from the `train_op` (which is
# discarded) and the `loss` op. To inspect the values
# of your ops or variables, you may include them in
# the list passed to sess.run() and the value tensors
# will be returned in the tuple from the call.
_, loss_value = sess.run([train_op, loss])
duration = time.time() - start_time
# Print an overview fairly often.
if step % 100 == 0:
print('Step %d: loss = %.2f (%.3f sec)' % (step, loss_value,
duration))
step += 1
except tf.errors.OutOfRangeError:
print('Done training for %d epochs, %d steps.' % (FLAGS.num_epochs, step))
finally:
# When done, ask the threads to stop.
coord.request_stop()
# Wait for threads to finish.
coord.join(threads)
sess.close()
I have two doubts:
1) Is the variable num_epochs deciding the number of training iterations?
2) My model is pretty large and i want to checkpoint and restore and train.
How do I know for a restored model how many iterations are done and how many are left?
1) as stated in the tensorflow api tf.train.string_input_producer will throw a tf.errors.OutOfRangeError once each string has been produced for num_epoch times. So yes, num_epochs will be deciding the number of training iterations in your code.
2) I think it might be possible to declare a tf.Variable and increase its value for each epoch that you run, so when you restore your model you could read that value again and train for the remaining epochs. Unfortunately i dont know if there is a smarter way, since most people only save their models for predictions after the training, or do finetuning for a fix number of epochs.
Hope i could help
In TF's official tutorial code 'cifar10', there is an evaluation snippet:
def evaluate():
with tf.Graph().as_default() as g:
# Get images and labels for CIFAR-10.
eval_data = FLAGS.eval_data == 'test'
images, labels = cifar10.inputs(eval_data=eval_data)
# Build a Graph that computes the logits predictions from the
# inference model.
logits = cifar10.inference(images)
# Calculate predictions.
top_k_op = tf.nn.in_top_k(logits, labels, 1)
# Restore the moving average version of the learned variables for eval.
variable_averages = tf.train.ExponentialMovingAverage(
cifar10.MOVING_AVERAGE_DECAY)
variables_to_restore = variable_averages.variables_to_restore()
saver = tf.train.Saver(variables_to_restore)
# Build the summary operation based on the TF collection of Summaries.
summary_op = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter(FLAGS.eval_dir, g)
while True:
eval_once(saver, summary_writer, top_k_op, summary_op)
if FLAGS.run_once:
break
time.sleep(FLAGS.eval_interval_secs)
At runtime, it evaluates one batch of test samples and prints out 'precision' in the console every other eval_interval_secs, my questions are:
each time eval_once() is executed, one batch of samples (128) are dequeued from the data queue, but why I didn't see the evaluation stop after enough batches, 10000/128 + 1 = 79 batches? I thought it should stop after 79 batches.
Are batches from the first 79 sampling mutually exclusive? I'd assume so but want to double-check this.
If each batch is indeed dequeued from the data queue, what are the samples after 79 times of sampling? some random sampling from the entire duplicate data queue again?
since in_top_k() is taking in some unnormalized logit values and output a string of booleans, this masks the internal conversions of softmax() + thresholding. Is there a TF op for such explicit computations? Ideally, it'd be useful to be able to tune the threshold and see different classification results.
Please help.
Thanks!
You can see the following line in "inputs" def of cifar10_input.py
filename_queue = tf.train.string_input_producer(filenames)
More about tf.train.string_input_producer :
string_input_producer(
string_tensor,
num_epochs=None,
shuffle=True,
seed=None,
capacity=32,
shared_name=None,
name=None,
cancel_op=None
)
num_epochs : produces each string from string_tensor num_epochs times before generating an OutOfRange error. If not specified, string_input_producer can cycle through the strings in string_tensor an unlimited number of times.
In our case, num_epochs is not specified. That's why it does not stop after few batches. It can run unlimited times.
By default, shuffle option is set to True in tf.train.string_input_producer. So, it shuffles the data first and copies that shuffled 10K filenames again and again.
Therefore, it's mutually exclusive. You can print filenames to see this.
As explained in 1, they are repeated samples. (not any random data)
You could avoid using tf.nn.in_top_k. Use tf.nn.softmax and tf.greater_equal to obtain boolean tensor that has softmax value above the specific threshold.
I hope this helps. Please comment if there is any misunderstanding.
Using TF 0.12.1, we are trying to understand how the performance of Tensorflow breaks down. In particular, we are looking at the Inception-v3 model, and how long the forward pass step takes.
The first step we looked at was to run a benchmark on just in the inference step. To avoid queueing time, we set the training example to a constant tensor and run it through the inception model. The train method in the code is below
def train(dataset):
"""Train on dataset for a number of steps."""
with tf.Graph().as_default(), tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (dataset.num_examples_per_epoch() /
FLAGS.batch_size)
decay_steps = int(num_batches_per_epoch * FLAGS.num_epochs_per_decay)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(FLAGS.initial_learning_rate,
global_step,
decay_steps,
FLAGS.learning_rate_decay_factor,
staircase=True)
# Create an optimizer that performs gradient descent.
opt = tf.train.RMSPropOptimizer(lr, RMSPROP_DECAY,
momentum=RMSPROP_MOMENTUM,
epsilon=RMSPROP_EPSILON)
# Get images and labels for ImageNet and split the batch across GPUs.
assert FLAGS.batch_size % FLAGS.num_gpus == 0, (
'Batch size must be divisible by number of GPUs')
split_batch_size = int(FLAGS.batch_size / FLAGS.num_gpus)
num_classes = dataset.num_classes() + 1
# Calculate the gradients for each model tower.
tower_grads = []
reuse_variables = None
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope:
# Force all Variables to reside on the CPU.
with slim.arg_scope([slim.variables.variable], device='/cpu:0'):
# Calculate the loss for one tower of the ImageNet model. This
# function constructs the entire ImageNet model but shares the
# variables across all towers.
image_shape = (FLAGS.batch_size, FLAGS.image_size, FLAGS.image_size, 3)
labels_shape = (FLAGS.batch_size)
images = tf.zeros(image_shape, dtype=tf.float32)
labels = tf.zeros(labels_shape, dtype=tf.int32)
logits = _tower_loss(images, labels, num_classes,
scope, reuse_variables)
# Reuse variables for the next tower.
reuse_variables = True
# Build an initialization operation to run below.
init = tf.initialize_all_variables()
# Start running operations on the Graph. allow_soft_placement must be set to
# True to build towers on GPU, as some of the ops do not have GPU
# implementations.
sess = tf.Session(config=tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=FLAGS.log_device_placement))
sess.run(init)
# Start the queue runners.
tf.train.start_queue_runners(sess=sess)
for step in xrange(FLAGS.max_steps):
start_time = time.time()
loss_value = sess.run(logits)
duration = time.time() - start_time
examples_per_sec = FLAGS.batch_size / float(duration)
format_str = ('%s: step %d, loss =(%.1f examples/sec; %.3f '
'sec/batch)')
print(format_str % (datetime.now(), step,
examples_per_sec, duration))
For 8 GPUs, a batch size of 32, and 1 param server, we observe 0.44 seconds per logits operation which does the forward pass. However, when we run the timeline tool, we observe a much smaller inference time (see figure below). For the GPU runtime, observe that there is an initial burst followed by a break, followed by a longer GPU burst. We assume the initial burst is the forward pass while the second burst is the backpropagation.
If the initial burst really is the forward pass time, it is substantially less than 0.44 seconds. Can anyone explain the discrepancy between these results? Is it a mistake with the benchmarking app or is the timeline tool not capturing the full picture? Additionally, there are a couple of GPU operations before the first large burst that we cannot really explain. Any insight into this would be very much appreciated!
TensorFlow has undergone a number of significant performance improvements since TF 0.12.1. If you are interested in solid performance numbers, please use the latest version of TensorFlow, or version 1.2 when it is released.
If you would like to work from a high-performance model as a starting point, I strongly recommend working from https://github.com/tensorflow/benchmarks which include an Inception-v3 model.
As for trying to understand the detailed performance of a single step, I recommend instrumenting the C++ TensorFlow runtime. (The overhead from within Python can be significant, and could introduce uncertainty in your measurements.)
Additionally, it's important to run the experiment a number of iterations to allow the system to "warm up" and fully initialize.
One thing to note: if you are trying to tune your model, be sure to avoid setting allow_soft_placement=True. For now, it's better to ensure that all operations you expect are truly placed on the GPUs. You can confirm by looking at the log output controlled by the log_device_placement parameter.
I know how to use an input pipeline to read data from files:
input = ... # Read from file
loss = network(input) # build a network
train_op = ... # Using SGD or other algorithms to train the network.
But how can I switch between multiple input pipelines? Say, if I want to train a network for 1000 batches on the training set from the training pipeline, then validate it on a validation set from another pipeline, then keep training, then validate, then train, ..., and so forth.
It's easy to implement this with feed_dict. I also know how to use checkpoints to achieve this, just like in the cifar-10 example. But it's kind of cumbersome: I need to dump the model to disk then read it from disk again.
Can I just switch between two input pipelines (one for training data, one for validation data) to achieve this? Reading 1000 batches from the training data queue, then a few batched from the validation data queue, and so forth. If it is possible, how to do it?
Not sure if this is exactly what you are looking for, but I am doing training and validation in the same code in two separate loops. My code reads numeric and string data from .CSV files, not images. I am reading from two separate CSV files, one for training and one for validation. I'm sure you can generalize it to read from two 'sets' of files, rather than just single files, as the code is there.
Here are the code snippets in case it helps. Note that this code first reads everything as string and then converts the necessary cells into floats, just given my own requirements. If your data is purely numeric, you should just set the defaults to floats and all should be easier. Also, there are a couple of lines in there that drop Weights and Biases into a CSV file AND serialize them into the TF checkpoint file, depending on which way you'd prefer.
#first define the defaults:
rDefaults = [['a'] for row in range((TD+TS+TL))]
# this function reads line-by-line from CSV and separates cells into chunks:
def read_from_csv(filename_queue):
reader = tf.TextLineReader(skip_header_lines=False)
_, csv_row = reader.read(filename_queue)
data = tf.decode_csv(csv_row, record_defaults=rDefaults)
dateLbl = tf.slice(data, [0], [TD])
features = tf.string_to_number(tf.slice(data, [TD], [TS]), tf.float32)
label = tf.string_to_number(tf.slice(data, [TD+TS], [TL]), tf.float32)
return dateLbl, features, label
#this function loads the above lines and spits them out as batches of N:
def input_pipeline(fName, batch_size, num_epochs=None):
filename_queue = tf.train.string_input_producer(
[fName],
num_epochs=num_epochs,
shuffle=True)
dateLbl, features, label = read_from_csv(filename_queue)
min_after_dequeue = 10000
capacity = min_after_dequeue + 3 * batch_size # max of how much to load into memory
dateLbl_batch, feature_batch, label_batch = tf.train.shuffle_batch(
[dateLbl, features, label],
batch_size=batch_size,
capacity=capacity,
min_after_dequeue=min_after_dequeue)
return dateLbl_batch, feature_batch, label_batch
# These are the TRAINING features, labels, and meta-data to be loaded from the train file:
dateLbl, features, labels = input_pipeline(fileNameTrain, batch_size, try_epochs)
# These are the TESTING features, labels, and meta-data to be loaded from the test file:
dateLblTest, featuresTest, labelsTest = input_pipeline(fileNameTest, batch_size, 1) # 1 epoch here regardless of training
# then you define the model, start the session, blah blah
# fire up the queue:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
#This is the TRAINING loop:
try:
while not coord.should_stop():
dateLbl_batch, feature_batch, label_batch = sess.run([dateLbl, features, labels])
_, acc, summary = sess.run([train_step, accuracyTrain, merged_summary_op], feed_dict={x: feature_batch, y_: label_batch,
keep_prob: dropout,
learning_rate: lRate})
except tf.errors.OutOfRangeError: # (so done reading the file(s))
# by the way, this dumps weights and biases into a CSV file, since you asked for that
np.savetxt(fPath + fIndex + '_weights.csv', sess.run(W),
# and this serializes weight and biases into the TF-formatted protobuf:
# tf.train.Saver({'varW': W, 'varB': b}).save(sess, fileNameCheck)
finally:
coord.request_stop()
# now re-start the runners for the testing file:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
try:
while not coord.should_stop():
# so now this line reads features, labels, and meta-data, but this time from the training file:
dateLbl_batch, feature_batch, label_batch = sess.run([dateLblTest, featuresTest, labelsTest])
guessY = tf.argmax(y, 1).eval({x: feature_batch, keep_prob: 1})
trueY = tf.argmax(label_batch, 1).eval()
accuracy = round(tf.reduce_mean(tf.cast(tf.equal(guessY, trueY), tf.float32)).eval(), 2)
except tf.errors.OutOfRangeError:
acCumTest /= i
finally:
coord.request_stop()
coord.join(threads)
This may differ from what you are trying to do in the sense that it first completes the Training loop and THEN restarts the queues for the Testing loop. Not sure how you'd do this if you want to go back and fourth, but you can try to experiment with the two functions defined above by passing them the relevant file names (or lists) interchangeably.
Also I'm not sure if re-starting the queues after training is the best way to go, but it works for me. Would love to see a better example out there, as most TF examples use some built-in wrappers around the MNIST dataset to do the training in one go...