Breaking down Tensorflow performance with timeline and benchmarking - tensorflow

Using TF 0.12.1, we are trying to understand how the performance of Tensorflow breaks down. In particular, we are looking at the Inception-v3 model, and how long the forward pass step takes.
The first step we looked at was to run a benchmark on just in the inference step. To avoid queueing time, we set the training example to a constant tensor and run it through the inception model. The train method in the code is below
def train(dataset):
"""Train on dataset for a number of steps."""
with tf.Graph().as_default(), tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (dataset.num_examples_per_epoch() /
FLAGS.batch_size)
decay_steps = int(num_batches_per_epoch * FLAGS.num_epochs_per_decay)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(FLAGS.initial_learning_rate,
global_step,
decay_steps,
FLAGS.learning_rate_decay_factor,
staircase=True)
# Create an optimizer that performs gradient descent.
opt = tf.train.RMSPropOptimizer(lr, RMSPROP_DECAY,
momentum=RMSPROP_MOMENTUM,
epsilon=RMSPROP_EPSILON)
# Get images and labels for ImageNet and split the batch across GPUs.
assert FLAGS.batch_size % FLAGS.num_gpus == 0, (
'Batch size must be divisible by number of GPUs')
split_batch_size = int(FLAGS.batch_size / FLAGS.num_gpus)
num_classes = dataset.num_classes() + 1
# Calculate the gradients for each model tower.
tower_grads = []
reuse_variables = None
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope:
# Force all Variables to reside on the CPU.
with slim.arg_scope([slim.variables.variable], device='/cpu:0'):
# Calculate the loss for one tower of the ImageNet model. This
# function constructs the entire ImageNet model but shares the
# variables across all towers.
image_shape = (FLAGS.batch_size, FLAGS.image_size, FLAGS.image_size, 3)
labels_shape = (FLAGS.batch_size)
images = tf.zeros(image_shape, dtype=tf.float32)
labels = tf.zeros(labels_shape, dtype=tf.int32)
logits = _tower_loss(images, labels, num_classes,
scope, reuse_variables)
# Reuse variables for the next tower.
reuse_variables = True
# Build an initialization operation to run below.
init = tf.initialize_all_variables()
# Start running operations on the Graph. allow_soft_placement must be set to
# True to build towers on GPU, as some of the ops do not have GPU
# implementations.
sess = tf.Session(config=tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=FLAGS.log_device_placement))
sess.run(init)
# Start the queue runners.
tf.train.start_queue_runners(sess=sess)
for step in xrange(FLAGS.max_steps):
start_time = time.time()
loss_value = sess.run(logits)
duration = time.time() - start_time
examples_per_sec = FLAGS.batch_size / float(duration)
format_str = ('%s: step %d, loss =(%.1f examples/sec; %.3f '
'sec/batch)')
print(format_str % (datetime.now(), step,
examples_per_sec, duration))
For 8 GPUs, a batch size of 32, and 1 param server, we observe 0.44 seconds per logits operation which does the forward pass. However, when we run the timeline tool, we observe a much smaller inference time (see figure below). For the GPU runtime, observe that there is an initial burst followed by a break, followed by a longer GPU burst. We assume the initial burst is the forward pass while the second burst is the backpropagation.
If the initial burst really is the forward pass time, it is substantially less than 0.44 seconds. Can anyone explain the discrepancy between these results? Is it a mistake with the benchmarking app or is the timeline tool not capturing the full picture? Additionally, there are a couple of GPU operations before the first large burst that we cannot really explain. Any insight into this would be very much appreciated!

TensorFlow has undergone a number of significant performance improvements since TF 0.12.1. If you are interested in solid performance numbers, please use the latest version of TensorFlow, or version 1.2 when it is released.
If you would like to work from a high-performance model as a starting point, I strongly recommend working from https://github.com/tensorflow/benchmarks which include an Inception-v3 model.
As for trying to understand the detailed performance of a single step, I recommend instrumenting the C++ TensorFlow runtime. (The overhead from within Python can be significant, and could introduce uncertainty in your measurements.)
Additionally, it's important to run the experiment a number of iterations to allow the system to "warm up" and fully initialize.
One thing to note: if you are trying to tune your model, be sure to avoid setting allow_soft_placement=True. For now, it's better to ensure that all operations you expect are truly placed on the GPUs. You can confirm by looking at the log output controlled by the log_device_placement parameter.

Related

Running training the discriminator with more examples

As I understand what of the diff between regular GAN to WGAN is that we train the discriminator/critic with more examples in each epoch. If in the regular gan we have in each epoch one batch for both modules, in WGAN we will have 5 batches (or more) for the discriminator and one for the generator.
So basically we have another inner loop for the discriminator :
real_images_labels = np.ones((BATCH_SIZE, 1))
fake_images_labels = -real_images_labels
for epoch in range(epochs):
for batch in range(NUM_BACHES):
for critic_iter in range(n_critic):
random_batches_idx = np.random.randint(0, NUM_BACHES) # Choose random batch from dataset
imgs_data=dataset_list[random_batches_idx]
c_loss_real = critic.train_on_batch(imgs_data, real_images_labels) # update the weights after 1 batch
noise = tf.random.normal([imgs_data.shape[0], noise_dim]) # Generate noise data
generated_images = generator(noise, training=True)
c_loss_fake = critic.train_on_batch(generated_images, fake_images_labels) # update the weights after 1 batch
imgs_data=dataset_list[batch]
noise = tf.random.normal([imgs_data.shape[0], noise_dim]) # Generate noise data
gen_loss_batch = gen_loss_batch + gan.train_on_batch(noise,real_images_labels)
The training is taking me a lot of time, per epoch about 3m. The idea I had to decrease the training time is instead running forward for each batch n_critic times I can increase the batch_size for the discriminator and run forward one time with a bigger batch_size.
I am seeking feedback: does it sound reasonable?
(I didn't paste my entire code, it was just a part of it).
Yes, it does sound reasonable typically increasing batch_size during training, typically decreases the training time with a cost of using more memory and lower accuracy (lower generalization ability).
Having said this you should do always do trial and error with regards to batching as extreme values may or may not increase the training time.
For further discussion you can refer to this question

Basic TPU Cross Shard Optimizer Not Working

In general, there are some good examples that use TF optimizers for solving general (non deep learning) problems. Given:
https://databricks.com/tensorflow/training-and-convergence
https://colab.research.google.com/notebooks/tpu.ipynb#scrollTo=a_rjVo-RAoYd
We want to be able to combine the two above and make use of TPU based optimization in solving high dimensional problems.
To that end I've got a simple colab code that does this merging the two examples above:
import tensorflow as tf
import numpy as np
from tensorflow.contrib.tpu.python.tpu import tpu_function
import os
import pprint
import tensorflow as tf
if 'COLAB_TPU_ADDR' not in os.environ:
print('ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!')
else:
tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print ('TPU address is', tpu_address)
with tf.Session(tpu_address) as session:
devices = session.list_devices()
print('TPU devices:')
pprint.pprint(devices)
# Add this somewhere at the top
tpu_function.get_tpu_context().set_number_of_shards(8)
# x and y are placeholders for our training data
x = tf.placeholder("float")
y = tf.placeholder("float")
# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")
# Our model of y = a*x + b
y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3
# Our error is defined as the square of the differences
error = tf.square(y - y_model)
# The Gradient Descent Optimizer does the heavy lifting
train_op = tf.train.AdamOptimizer(0.01)
optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error) # TPU change 1
# Normal TensorFlow - initialize values, create a session and run the model
model = tf.global_variables_initializer()
with tf.Session(tpu_address) as session:
session.run(tf.contrib.tpu.initialize_system())
print('init')
session.run(model)
for i in range(10000):
print(i)
x_value = np.random.rand()
y_value = x_value * 2 + 6 + 5 + 3
session.run(optimizer, feed_dict={x: x_value, y: y_value})
w_value = session.run(w)
print("Predicted model: {a:.3f}x + {b:.3f}+{c:.3f}x + {d:.3f}".format(a=w_value[0], b=w_value[1], c=w_value[2], d=w_value[3]))
session.run(tpu.shutdown_system())
When I run it (in colab) as it is it just runs the first loop printing:
init
0
and then does nothing and colab just keeps spanning.
If I do not use
optimizer = tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error)
And other TPU features, then it works fine estimating the w Variable.
The questions are:
Why doesn't this work and how can we get the cross shard replicator to optimise this simple function?
How can shall I shape variable w to make use of parallel batches/shards on the TPU?
How can we make this even more efficient through use of an equivalent Dataset prefetch operation or using infeed queues?
The goal is to make use of lower level TPU APIs without TPUEstimator for example to help solve custom problems by leveraging the power of TPUs using the tensors , queues and shards only.
It doesn't work because you are overriding the number of shards without actually splitting the calculations into shards. When I run your code, I get the following error:
InternalError: From /job:tpu_worker/replica:0/task:0:
RET_CHECK failure (platforms/xla/service/jellyfish/lowering/all_reduce_emitter.cc:832) replica_id < target.ReplicaCount() Unexpected replica id in all-reduce, replica_id is 1, target has 1 replicas.
Error encountered while compiling %all-reduce.7 = f32[4]{0:T(256)} all-reduce(f32[4]{0:T(256)} %arg0.1), replica_groups={{0,1,2,3,4,5,6,7}}, to_apply=%sum.3, metadata={op_type="CrossReplicaSum" op_name="CrossReplicaSum_21"}, backend_config="{barrier_type:3}".
It is trying to perform the computations on eight shards and combine the results, but it only has one shard to work with. Take a look at tf.contrib.tpu.shard. It creates a shard context using the given number of shards and distributes a computation over those shards. So, instead of setting the number of shards manually, you can define your variables as usual and then wrap any computations with them in a function to be sharded:
# REMOVE THIS
# tpu_function.get_tpu_context().set_number_of_shards(8)
# x and y are placeholders for our training data
x_placeholder = tf.placeholder("float")
y_placeholder = tf.placeholder("float")
# w is the variable storing our values. It is initialised with starting "guesses"
# w[0] is the "a" in our equation, w[1] is the "b"
w = tf.Variable([1.0, 2.0,3.0, 4.0], name="w")
# Wrap all of our tensorflow operations in a function we can shard
def calculations(x, y):
# Our model of y = a*x + b
y_model = tf.multiply(x, w[0]) + w[1] + w[2] +3
# Our error is defined as the square of the differences
# Average across the entire batch
error = tf.reduce_mean(tf.square(y - y_model))
# The Gradient Descent Optimizer does the heavy lifting
train_op = tf.train.AdamOptimizer(0.01)
return tf.contrib.tpu.CrossShardOptimizer(train_op).minimize(error)
# Shard the function so that its calculation is distributed
optimizer = tf.contrib.tpu.shard(calculations, inputs=[x_placeholder, y_placeholder], num_shards=8)
You don't need to shape w to make use of shards, because sharding occurs across the batch dimension and you only have one set of weights for all inputs. You'll want to add a batch dimension to your inputs so that each batch can be distributed across the cores. shard assumes the first dimension is the batch dimension, but includes an argument to change it if your data is shaped differently. According to the TPU troubleshooting page, the ideal batch size is 1024 so that there are 128 samples per TPU core. If that is too big for your model, you can go smaller as long as it is a multiple of 128. Check out the above link and the performance guide for more tips on increasing performance.
for i in range(1000):
print(i)
x_value = np.random.rand(1024) # Generate a batch of 1024 values
y_value = x_value * 2 + 6 + 5 + 3
session.run(optimizer, feed_dict={x_placeholder: x_value, y_placeholder: y_value})
Everything else should remain the same. I was able to train the model for all 10000 iterations. Keep in mind that for this simple model it will probably be slower than using CPU/GPU, but you should expect performance improvements for more complex problems with larger datasets.
I'm not familiar enough with Datasets or infeed queues to comment on this, but shard includes an argument for infeed queues so it likely has support for them. You might have to play around with it to see how it gets data to the computation function.

How to feed the list of gradients, or (grad, variable name) pairs, to my model

This is related to a previous question: How to partition a single batch into many invocations to save memory, and also to How to train a big model with relatively large batch size on a single GPU using Tensorflow?; but, still I couldn't find the exact answer. For example, the answer to another related question tensorflow - run optimizer op on a large batch doesn't work for me (btw. it wasn't accepted and there are no more comments there).
I want to try to simulate larger batch size but using only one GPU.
So, I need to compute the gradients for every smaller batch, aggregate/average them across several such smaller batches, and only then apply.
(Basically, it's like synchronized distributed SGD, but on a single device/GPU, performed serially. Of course, the acceleration advantage of distributed SGD is lost but larger batch size itself will maybe enable convergence to larger accuracy and larger step size, as indicated by a few recent papers.)
To keep memory requirement low, I should do standard SGD with small batches, update the gradients after every iteration and then call optimizer.apply_gradients() (where optimizer is one of the implemented optimizers).
So, everything looks simple but when I go to implement it, it is actually not so trivial.
For example, I would like to use one Graph, compute gradients for each iteration, and then, when multiple batches are processed, sum the gradients and pass them to my model. But the list itself can't be fed into the feed_dict parameter of sess.run. Also, passing gradients directly doesn't exactly work, I get the TypeError: unhashable type: 'numpy.ndarray' (I think the reason is that I can't pass in the numpy.ndarray, only tensorflow variable).
I could define a placeholder for the gradients, but for that I would need tu build the model first (to specify the trainable variables etc.).
All in all, please tell me there is a simpler way to implement this.
There is no simpler way than what you have already been told. That way may seem complicated at first, but it actually is really simple. You just have to use the low level API to manually calculate the gradients for each batch, average over them and than manually feed the averaged gradients to the optimizer to apply them.
I'll try to provide some stripped down code of how to do this. I'll use dots as placeholders for actual code which would depend on the problem. What you would usually do would be something like this:
import tensorflow as tf
[...]
input = tf.placeholder(...)
[...]
loss = ...
[...]
# initialize the optimizer
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
# define operation to apply the gradients
minimize = optimizer.minimize(loss)
[...]
if __name__ == '__main__':
session = tf.Session(config=CONFIG)
session.run(tf.global_variables_initializer())
for step in range(1, MAX_STEPS + 1):
data = ...
loss = session.run([minimize, loss],
feed_dict={input: data})[1]
What you want to do instead now, to average over multiple batches to preserver memory would be this:
import tensorflow as tf
[...]
input = tf.placeholder(...)
[...]
loss = ...
[...]
# initialize the optimizer
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
# grab all trainable variables
trainable_variables = tf.trainable_variables()
# define variables to save the gradients in each batch
accumulated_gradients = [tf.Variable(tf.zeros_like(tv.initialized_value()),
trainable=False) for tv in
trainable_variables]
# define operation to reset the accumulated gradients to zero
reset_gradients = [gradient.assign(tf.zeros_like(gradient)) for gradient in
accumulated_gradients]
# compute the gradients
gradients = optimizer.compute_gradients(loss, trainable_variables)
# Note: Gradients is a list of tuples containing the gradient and the
# corresponding variable so gradient[0] is the actual gradient. Also divide
# the gradients by BATCHES_PER_STEP so the learning rate still refers to
# steps not batches.
# define operation to evaluate a batch and accumulate the gradients
evaluate_batch = [
accumulated_gradient.assign_add(gradient[0]/BATCHES_PER_STEP)
for accumulated_gradient, gradient in zip(accumulated_gradients,
gradients)]
# define operation to apply the gradients
apply_gradients = optimizer.apply_gradients([
(accumulated_gradient, gradient[1]) for accumulated_gradient, gradient
in zip(accumulated_gradients, gradients)])
# define variable and operations to track the average batch loss
average_loss = tf.Variable(0., trainable=False)
update_loss = average_loss.assign_add(loss/BATCHES_PER_STEP)
reset_loss = average_loss.assign(0.)
[...]
if __name__ == '__main__':
session = tf.Session(config=CONFIG)
session.run(tf.global_variables_initializer())
data = [batch_data[i] for i in range(BATCHES_PER_STEP)]
for batch_data in data:
session.run([evaluate_batch, update_loss],
feed_dict={input: batch_data})
# apply accumulated gradients
session.run(apply_gradients)
# get loss
loss = session.run(average_loss)
# reset variables for next step
session.run([reset_gradients, reset_loss])
This should be runnable if you fill in the gaps. However I might have made a mistake while stripping it down and pasting it here. For a runnable example you can take a look into a project I am currently working on myself.
I also want to make clear that this is not the same as evaluating the loss for all the batch data at once, since you average over the gradients. This is especially important when your loss does not work well with low statistics. Take a chi square of histograms for example, calculating the average gradients for a chi square of histograms with low bin counts won't be as good as calculating the gradient on just one histogram with all the bins filled up at once.
You would need to give the gradients as the values that get passed to apply_gradients. It can be placeholders, but it is probably easier to use the usual compute_gradients/apply_gradients combination:
# Some loss measure
loss = ...
optimizer = ...
gradients = optimizer.compute_gradients(loss)
# gradients is a list of pairs
_, gradient_tensors = zip(*gradients)
# Apply gradients as usual
train_op = optimizer.apply_gradients(gradients)
# On training
# Compute some gradients
gradient_values = session.run(gradient_tensors, feed_dict={...})
# gradient_values is a sequence of numpy arrays with gradients
# After averaging multiple evaluations of gradient_values apply them
session.run(train_op, feed_dict=dict(zip(gradient_tensors, gradient_values_average)))
If you want to compute the averages of the gradients within TensorFlow too, that requires a bit of extra code specifically for that, maybe something like this:
# Some loss measure
loss = ...
optimizer = ...
gradients = optimizer.compute_gradients(loss)
# gradients is a list of pairs
_, gradient_tensors = zip(*gradients)
# Apply gradients as usual
train_op = optimizer.apply_gradients(gradients)
# Additional operations for gradient averaging
gradient_placeholders = [tf.placeholder(t.dtype, (None,) + t.shape)
for t in gradient_tensors]
gradient_averages = [tf.reduce_mean(p, axis=0) for p in gradient_placeholders]
# On training
gradient_values = None
# Compute some gradients
for ...: # Repeat for each small batch
gradient_values_current = session.run(gradient_tensors, feed_dict={...})
if gradient_values is None:
gradient_values = [[g] for g in gradient_values_current]
else:
for g_list, g in zip(gradient_values, gradient_values_current):
g_list.append(g)
# Stack gradients
gradient_values = [np.stack(g_list) for g_list in gradient_values)
# Compute averages
gradient_values_average = session.run(
gradient_averages, feed_dict=dict(zip(gradient_placeholders, gradient_values)))
# After averaging multiple gradients apply them
session.run(train_op, feed_dict=dict(zip(gradient_tensors, gradient_values_average)))

Is it possible to loop through all minibatches in a single tensorflow op using dataset/iterators?

I'm working with tf.data.dataset/iterator mechanism and trying to improve data loading performance. It occurred to me that offloading the entire minibatch loop from Python might help. My data is small enough that storing on CPU or GPU is no problem.
So, Is it possible to loop an optimizer node over a full minibatched epoch within a call to session.run?
The tensor returned by iterator.get_next() is only incremented once per session.run, which would seems to make it impossible to iterate through a dataset of minibatches... but if it could be done, my CPU would only have to touch the Python thread once per epoch.
UPDATE: #muskrat's suggestion to use tf.slice can be used for this purpose. See my subsequent non-answer with a schematic implementation of this using tf.while_loop. However, the question is whether this can be accomplished using dataset/iterators... and I'd still like to know.
From the description it seems that you already have the dataset preloaded as a constant on CPU/GPU, like at this example. That's certainly the first step.
Second, I suggest using tf.slice() to replicate the effect of the minibatch operation. In other words, just manually slice minibatches out of the preloaded constant (your dataset), and you should get the desired behavior. See for example the slice docs or this related post.
If that's not enough detail, please edit your question to include a code example (with mnist or something) and I can give more details.
This "answer" is an implementation of muskrat's tf.slice suggestion with the details of tf.while_loop worked out (with help from How to use tf.while_loop() in tensorflow and https://www.tensorflow.org/api_docs/python/tf/while_loop).
Unless your data and model are small enough that you're bottlenecked by Python I/O (like me!), this solution is probably academic.
Advantages:
Trains over minibatches without returning to the Python thread.
Uses only ops that have GPU implementations meaning that the entire graph can be placed in the GPU.
On my small dataset, which is presumably bottlenecked by Python I/O, this solution is twice the speed of my dataset/iteratior (which touches Python once per minibatch) and four times the speed of passing minibatches through feed_dict.
Disadvantages:
tf.while_loop is treacherous. It's challenging to understand when ops inside the loop's body are evaluated and when those they depend on are evaluated, particularly the (thin) official documentation and limited Stack Overflow coverage.
The missing documentation of tf.while_loop is that tensors outside the body of the loop are only evaluated once, even if inner ops depend on them. This means that optimization, model, and loss have to be defined in the loop. This limits flexibility if you'd like to e.g. be able to call validation loss ops between training epochs. Presumably this could be accomplished with tf.cond statements and the appropriate flags passed in via feed_dict. But not nearly as flexible or elegant as the dataset/iterator mechanism in tf.data.
Adding shuffling operations at each Epoch doesn't seem available on GPU.
Here's my schematic code (I've ommitted the variable and model definition for brevity):
def buildModel(info, training_data, training_targets):
graph = tf.Graph()
with graph.as_default():
# numBatches is passed in from Python once per Epoch.
batch_size = tf.placeholder(tf.float32, name = 'batch_size')
# Initializers for loop variables for tf.while_loop
batchCounter = tf.Variable(0, dtype=tf.float32, trainable=False)
lossList = tf.Variable(tf.zeros([0,1]), trainable=False)
# In a full example, I'd normalize my data here. And possibly shuffle
tf_training_data = tf.constant(training_data, dtype=tf.float32)
tf_training_targets = tf.constant(training_targets, dtype=tf.float32)
# For brevity, I'll spare the definitions of my variables. Because tf.Variables
# are essentially treated as globals in the model and are manipulated directly (like with tf.apply)
# they can reside outside runMinibatch, the body of tf.while_loop.
# weights_1 =
# biases_1 =
# etc.
def moreMinibatches(batchCount, lossList):
return (batchCount + 1) * batch_size <= len(training_data)
def runMinibatch(batchCount, lossList):
# These tensors and ops have to be defined inside runMinibatch, otherwise they're not updated as tf.wile_loop loops. This means
# slices, model definition, loss tensor, and training op.
dat_batch = tf.slice(tf_training_data, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
targ_batch = tf.slice(tf_training_targets, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
# Here's where you'd define the model as a function of weights and biases above and dat_batch
# model = <insert here>
loss = tf.reduce_mean(tf.squared_difference(model, targ_batch))
optimizer = tf.train.AdagradOptimizer() # for example
train_op = optimizer.minimize(while_loss, name='optimizer')
# control_dependences ensures that train_op is run before return
# even though the return values don't explicitly depend on it.
with tf.control_dependencies([train_op]):
return batchCount + 1, tf.concat([lossList, [[while_loss]]],0)
# So, the idea is that this trains a full epoch without returning to Python.
trainMinibatches = tf.while_loop(moreMinibatches, runMinibatch, [minibatchCounter, lossList]
shape_invariants=[batchCounter.get_shape(), tf.TensorShape(None)])
return (graph,
{'trainMinibatches' : trainAllMinibatches,
'minibatchCounter' : minibatchCounter,
'norm_loss' : norm_loss,
} )
numEpochs = 100 # e.g.
minibatchSize = 32 #
# training_dataset = <data here>
# training_targets = <targets here>
graph, ops = buildModel(info, training_dataset, training_targets,
minibatch_size)
with tf.Session(graph=graph, config=config) as session:
tf.global_variables_initializer().run()
for i in range(numEpochs):
# This op will train on as all minibatches that fit in the full dataset. finalBatchCount with be the number of
# complete minibatches in the dataset. lossList is a list of each step's minibatches.
finalBatchCount, lossList = session.run(ops['trainAllMinibatches'],
feed_dict={'batch_size:0':minibatchSize})
print('minibatch losses at Epoch', i, ': ', lossList)
I implemented tf.slice() and tf.while_loop approach to vectorize mini-batch suggested above.
The performance was about 1.86 times faster in my case than the mini-batches using feed_dict, but I found there was a problem that the loss values of each epochs were not stabilized.
Then, I changed to tf.random_shuffle the inputs every epoch, the problem was much mitigated. (the performance gain was reduced to 1.68 times)

rationale behind the evaluation in tensorflow's tutorial code cifar10_eval.py

In TF's official tutorial code 'cifar10', there is an evaluation snippet:
def evaluate():
with tf.Graph().as_default() as g:
# Get images and labels for CIFAR-10.
eval_data = FLAGS.eval_data == 'test'
images, labels = cifar10.inputs(eval_data=eval_data)
# Build a Graph that computes the logits predictions from the
# inference model.
logits = cifar10.inference(images)
# Calculate predictions.
top_k_op = tf.nn.in_top_k(logits, labels, 1)
# Restore the moving average version of the learned variables for eval.
variable_averages = tf.train.ExponentialMovingAverage(
cifar10.MOVING_AVERAGE_DECAY)
variables_to_restore = variable_averages.variables_to_restore()
saver = tf.train.Saver(variables_to_restore)
# Build the summary operation based on the TF collection of Summaries.
summary_op = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter(FLAGS.eval_dir, g)
while True:
eval_once(saver, summary_writer, top_k_op, summary_op)
if FLAGS.run_once:
break
time.sleep(FLAGS.eval_interval_secs)
At runtime, it evaluates one batch of test samples and prints out 'precision' in the console every other eval_interval_secs, my questions are:
each time eval_once() is executed, one batch of samples (128) are dequeued from the data queue, but why I didn't see the evaluation stop after enough batches, 10000/128 + 1 = 79 batches? I thought it should stop after 79 batches.
Are batches from the first 79 sampling mutually exclusive? I'd assume so but want to double-check this.
If each batch is indeed dequeued from the data queue, what are the samples after 79 times of sampling? some random sampling from the entire duplicate data queue again?
since in_top_k() is taking in some unnormalized logit values and output a string of booleans, this masks the internal conversions of softmax() + thresholding. Is there a TF op for such explicit computations? Ideally, it'd be useful to be able to tune the threshold and see different classification results.
Please help.
Thanks!
You can see the following line in "inputs" def of cifar10_input.py
filename_queue = tf.train.string_input_producer(filenames)
More about tf.train.string_input_producer :
string_input_producer(
string_tensor,
num_epochs=None,
shuffle=True,
seed=None,
capacity=32,
shared_name=None,
name=None,
cancel_op=None
)
num_epochs : produces each string from string_tensor num_epochs times before generating an OutOfRange error. If not specified, string_input_producer can cycle through the strings in string_tensor an unlimited number of times.
In our case, num_epochs is not specified. That's why it does not stop after few batches. It can run unlimited times.
By default, shuffle option is set to True in tf.train.string_input_producer. So, it shuffles the data first and copies that shuffled 10K filenames again and again.
Therefore, it's mutually exclusive. You can print filenames to see this.
As explained in 1, they are repeated samples. (not any random data)
You could avoid using tf.nn.in_top_k. Use tf.nn.softmax and tf.greater_equal to obtain boolean tensor that has softmax value above the specific threshold.
I hope this helps. Please comment if there is any misunderstanding.