How to average summaries over multiple batches? - tensorflow

Assuming I have a bunch of summaries defined like:
loss = ...
tf.scalar_summary("loss", loss)
# ...
summaries = tf.merge_all_summaries()
I can evaluate the summaries tensor every few steps on the training data and pass the result to a SummaryWriter.
The result will be noisy summaries, because they're only computed on one batch.
However, I would like to compute the summaries on the entire validation dataset.
Of course, I can't pass the validation dataset as a single batch, because it would be too big.
So, I'll get summary outputs for each validation batch.
Is there a way to average those summaries so that it appears as if the summaries have been computed on the entire validation set?

Do the averaging of your measure in Python and create a new Summary object for each mean. Here is what I do:
accuracies = []
# Calculate your measure over as many batches as you need
for batch in validation_set:
accuracies.append(sess.run([training_op]))
# Take the mean of you measure
accuracy = np.mean(accuracies)
# Create a new Summary object with your measure
summary = tf.Summary()
summary.value.add(tag="%sAccuracy" % prefix, simple_value=accuracy)
# Add it to the Tensorboard summary writer
# Make sure to specify a step parameter to get nice graphs over time
summary_writer.add_summary(summary, global_step)

I would avoid calculating the average outside the graph.
You can use tf.train.ExponentialMovingAverage:
ema = tf.train.ExponentialMovingAverage(decay=my_decay_value, zero_debias=True)
maintain_ema_op = ema.apply(your_losses_list)
# Create an op that will update the moving averages after each training step.
with tf.control_dependencies([your_original_train_op]):
train_op = tf.group(maintain_ema_op)
Then, use:
sess.run(train_op)
That will call maintain_ema_op because it is defined as a control dependency.
In order to get your exponential moving averages, use:
moving_average = ema.average(an_item_from_your_losses_list_above)
And retrieve its value using:
value = sess.run(moving_average)
This calculates the moving average within your calculation graph.

I think it's always better to let tensorflow do the calculations.
Have a look at the streaming metrics. They have an update function to feed the information of your current batch and a function to get the averaged summary.
It's going to look somewhat like this:
accuracy = ...
streaming_accuracy, streaming_accuracy_update = tf.contrib.metrics.streaming_mean(accuracy)
streaming_accuracy_scalar = tf.summary.scalar('streaming_accuracy', streaming_accuracy)
# set up your session etc.
for i in iterations:
for b in batches:
sess.run([streaming_accuracy_update], feed_dict={...})
streaming_summ = sess.run(streaming_accuracy_scalar)
writer.add_summary(streaming_summary, i)
Also see the tensorflow documentation: https://www.tensorflow.org/versions/master/api_guides/python/contrib.metrics
and this question:
How to accumulate summary statistics in tensorflow

You can average store the current sum and recalculate the average after each batch, like:
loss_sum = tf.Variable(0.)
inc_op = tf.assign_add(loss_sum, loss)
clear_op = tf.assign(loss_sum, 0.)
average = loss_sum / batches
tf.scalar_summary("average_loss", average)
sess.run(clear_op)
for i in range(batches):
sess.run([loss, inc_op])
sess.run(average)

For future reference, the TensorFlow metrics API now supports this by default. For example, take a look at tf.mean_squared_error:
For estimation of the metric over a stream of data, the function creates an update_op operation that updates these variables and returns the mean_squared_error. Internally, a squared_error operation computes the element-wise square of the difference between predictions and labels. Then update_op increments total with the reduced sum of the product of weights and squared_error, and it increments count with the reduced sum of weights.
These total and count variables are added to the set of metric variables, so in practice what you would do is something like:
x_batch = tf.placeholder(...)
y_batch = tf.placeholder(...)
model_output = ...
mse, mse_update = tf.metrics.mean_squared_error(y_batch, model_output)
# This operation resets the metric internal variables to zero
metrics_init = tf.variables_initializer(
tf.get_default_graph().get_collection(tf.GraphKeys.METRIC_VARIABLES))
with tf.Session() as sess:
# Train...
# On evaluation step
sess.run(metrics_init)
for x_eval_batch, y_eval_batch in ...:
mse = sess.run(mse_update, feed_dict={x_batch: x_eval_batch, y_batch: y_eval_batch})
print('Evaluation MSE:', mse)

I found one solution myself. I think it's kind of hacky and I hope there is a more elegant solution.
During setup:
valid_loss_placeholder = tf.placeholder(dtype=tf.float32, shape=[])
valid_loss_summary = tf.scalar_summary("valid loss", valid_loss_placeholder)
Or for tensorflow versions after 0.12 (change in name for tf.scalar_summary):
valid_loss_placeholder = tf.placeholder(dtype=tf.float32, shape=[])
valid_loss_summary = tf.summary.scalar("valid loss", valid_loss_placeholder)
Within training loop:
# Compute valid loss in python by doing sess.run() for each batch
# and averaging
valid_loss = ...
summary = sess.run(valid_loss_summary, {valid_loss_placeholder: valid_loss})
summary_writer.add_summary(summary, step)

As of August 2018, streaming metrics have been depreciated. However, unintuitively, all metrics are streaming. So, use tf.metrics.accuracy.
However, if you want accuracy (or another metric) over only a subset of batches, then you can use Exponential Moving Average, as in the answer by #MZHm or reset any of the the tf.metric's by following this very informative blog post

For quite some time I'm only saving the summary once per epoch. I never knew that TensorFlows summary would then only save the summary for the last run batch.
Shocked I looked into this problem. This is the solution I came up with (using the dataset API):
loss = ...
train_op = ...
loss_metric, loss_metric_update = tf.metrics.mean(ae_loss)
tf.summary.scalar('loss', loss_metric)
merged = tf.summary.merge_all()
train_writer = tf.summary.FileWriter(os.path.join(res_dir, 'train'))
test_writer = tf.summary.FileWriter(os.path.join(res_dir, 'test'))
init_local = tf.initializers.local_variables()
init_global = tf.initializers.global_variables()
sess.run(init_global)
def train_run(epoch):
sess.run([dataset.train_init_op, init_local]) # test_init_op is the operation that switches to test data
for i in range(dataset.num_train_batches): # num_test_batches is the number of batches that should be run for the test set
sess.run([train_op, loss_metric_update])
summary, cur_loss = sess.run([merged, loss_metric])
train_writer.add_summary(summary, epoch)
return cur_loss
def test_run(epoch):
sess.run([dataset.test_init_op, init_local]) # test_init_op is the operation that switches to test data
for i in range(dataset.num_test_batches): # num_test_batches is the number of batches that should be run for the test set
sess.run(loss_metric_update)
summary, cur_loss = sess.run([merged, loss_metric])
test_writer.add_summary(summary, epoch)
return cur_loss
for epoch in range(epochs):
train_loss = train_run(epoch+1)
test_loss = test_run(epoch+1)
print("Epoch: {0:3}, loss: (train: {1:10.10f}, test: {2:10.10f})".format(epoch+1, train_loss, test_loss))
For the summary I'm just wrapping the tensor I'm interested in into tf.metrics.mean(). For each batch run I call the metrics update operation. At the end of every epoch the metrics tensor will return the correct mean of all batch results.
Don't forget to initialize local variables every time you switch between training and test data. Otherwise your train and test metrics will be near identical.

I had the same problem when I realized I had to iterate over my validation data when the memory space cramped up and the OOM errors flooding.
As multiple of these answers say, the tf.metrics have this built in, but I'm not using tf.metrics in my project. So inspired by that, I made this:
import tensorflow as tf
import numpy as np
def batch_persistent_mean(tensor):
# Make a variable that keeps track of the sum
accumulator = tf.Variable(initial_value=tf.zeros_like(tensor), dtype=tf.float32)
# Keep count of batches in accumulator (needed to estimate mean)
batch_nums = tf.Variable(initial_value=tf.zeros_like(tensor), dtype=tf.float32)
# Make an operation for accumulating, increasing batch count
accumulate_op = tf.assign_add(accumulator, tensor)
step_batch = tf.assign_add(batch_nums, 1)
update_op = tf.group([step_batch, accumulate_op])
eps = 1e-5
output_tensor = accumulator / (tf.nn.relu(batch_nums - eps) + eps)
# In regards to the tf.nn.relu, it's a hacky zero_guard:
# if batch_nums are zero then return eps, else it'll be batch_nums
# Make an operation to reset
flush_op = tf.group([tf.assign(accumulator, 0), tf.assign(batch_nums, 0)])
return output_tensor, update_op, flush_op
# Make a variable that we want to accumulate
X = tf.Variable(0., dtype=tf.float32)
# Make our persistant mean operations
Xbar, upd, flush = batch_persistent_mean(X)
Now you send Xbar to your summary e.g. tf.scalar_summary("mean_of_x", Xbar), and where you'd do sess.run(X) before, you'll do sess.run(upd). And between epochs you'd do sess.run(flush).
Testing behaviour:
### INSERT ABOVE CODE CHUNK IN S.O. ANSWER HERE ###
sess = tf.InteractiveSession()
with tf.Session() as sess:
sess.run([tf.global_variables_initializer(), tf.local_variables_initializer()])
# Calculate the mean of 1+2+...+20
for i in range(20):
sess.run(upd, {X: i})
print(sess.run(Xbar), "=", np.mean(np.arange(20)))
for i in range(40):
sess.run(upd, {X: i})
# Now Xbar is the mean of (1+2+...+20+1+2+...+40):
print(sess.run(Xbar), "=", np.mean(np.concatenate([np.arange(20), np.arange(40)])))
# Now flush it
sess.run(flush)
print("flushed. Xbar=", sess.run(Xbar))
for i in range(40):
sess.run(upd, {X: i})
print(sess.run(Xbar), "=", np.mean(np.arange(40)))

Related

Training runs out of memory as RAM consumption keeps growing

I am not sure since when am having this issue and I have to believe that this happened at some point between today and a few months ago but it would seem that the RAM (CPU) consumption grows over time during epochs.
self.model.fit(
train_data,
initial_epoch=self.status.valid_last.epoch,
epochs=train_config.epochs,
steps_per_epoch=train_config.steps_per_epoch,
callbacks=self._get_experiment_callbacks(),
validation_data=valid_data,
validation_steps=train_config.validation_steps,
)
The only thing out of the ordinary here might be the callbacks I am passing but there's actually nothing special here. One is a TensorBoard (TB) callback and the other is a custom Metric which is not doing much except plotting the learning rate and other general metrics to TB.
def _get_experiment_callbacks(self) -> List[tf.keras.callbacks.Callback]:
tensorboard_cb = tf.keras.callbacks.TensorBoard(
log_dir=os.path.join(out_dir, "logs"),
update_freq="epoch",
profile_batch=profile_batch,
write_images=True,
)
# Not interested in whatever is plotted in those
tensorboard_cb.on_epoch_end = lambda *args: ...
tensorboard_cb.on_test_end = lambda *args: ...
return [
tensorboard_cb,
Metrics(tensorboard_cb, update_freq=100),
]
This leaves us with the last suspect which is the valid_data itself. This is essentially just a list of protobuf files (shards) which I am loading like so:
def load_shards(
decode_example_fn: Callable,
shard_fps: List[str],
training: bool,
buffer_size: int = None # 50 * 1000 ** 2,
) -> tf.data.Dataset:
if not len(shard_fps) > 0:
raise ValueError("Argument shard_fps must be a list to shards but is empty.")
def make_dense_(example):
for k, v in example.items():
if isinstance(v, tf.SparseTensor):
example[k] = tf.sparse.to_dense(v)
return example
def load_records_(filenames):
record_dataset = tf.data.TFRecordDataset(filenames, buffer_size=buffer_size)
record_dataset = record_dataset.map(decode_example_fn)
record_dataset = record_dataset.map(make_dense_)
return record_dataset
if not training:
shard_fps = sorted(shard_fps)
dataset = tf.data.Dataset.from_tensor_slices(tf.constant(shard_fps))
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
dataset = dataset.with_options(options)
if training:
dataset = dataset.interleave(load_records_, num_parallel_calls=tf.data.AUTOTUNE, deterministic=False)
else:
dataset = dataset.apply(load_records_)
dataset = dataset.prefetch(tf.data.AUTOTUNE)
return dataset
and from then on there's just preprocessing and transformation mappings on the inputs. So.. I would not expect any memory leak at this point
Still, I am observing a continuous increase of memory consumption over time. The screenshot below shows the consumption after a restart.
At first we use ~28GB of RAM. After 100 steps there's a sharp increase, to ~33GB and from there it kind of seems to stabilize at around 38GB. The next big jump at 216k steps is coming from an evaluation. From there it's just constantly growing ..
From the looks it appears as if the memory usage stabilized and the jump only occurs after each epoch (1 epoch = 6000 steps).
There could be any number of things that could be wrong. TensorBoard could possibly not be reusing the same graph, but instead is adding graphs, which leads to OOM. I don't use TensorBoard myself because I remember this as happening to me a few years back. It's also possible that using model.fit is the problem and that you're loading your data at every epoch. You could try writing the training loop something like:
for epoch in tf.range(epochs):
batch_train_loss = []
batch_train_acc = []
for batch, (X, Y) in train_dataset.enumerate():
train_loss = train_fn(X, Y, model, loss, optimizer, metric, batch) # do the actual training
train_acc = metric.result().numpy() # get the training accuracy
batch_train_loss.append(train_loss) # save the training loss above
batch_train_acc.append(train_acc) # save the training accuracy above
metric.reset_states() # reset the metric after every batch
where the train_fn is:
def get_apply_train_fn():
#tf.function
def train_function(X, Y, model, loss, optimizer, metric, step):
with tf.GradientTape() as tape:
predictions = model(X, training=True)
loss_value = loss(Y, predictions)
gradients = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_acc = metric.update_state(Y, predictions)
return loss_value
return train_function
train_fn = get_apply_train_fn()
Now, this is a stupidly complicated way of writing model.fit, but it does work.
Another way in which I've had to combat OOM on GPU side is use Python's multiprocessing, but this was in a context where I was doing 10-fold cross-validation and the training would crash after 7 or 8 folds with OOM.
Alternatively, you could try turning eager execution on or off with
tf.config.run_functions_eagerly(False) # or True

how to track percentage complete and average training iteration runtime in tensorboard?

I have what I think should be a simple problem but I can't seem to figure it out.
Let's say that I have something like this
with tf.Session(graph=self.training_graph) as sess:
init = tf.global_variables_initializer()
logger.info("initializing global variables")
sess.run(init)
# add the operations that distory input images according to the hyperparameters
self._setup_meta_training_tensors()
self._add_jpeg_decoding()
self._add_input_distortions()
evaluation_step, prediction = self._add_evaluation_step(
self.train_final_tensor, self.train_ground_truth_input)
self.merged = tf.summary.merge_all()
self.train_writer = tf.summary.FileWriter(os.path.join(
self.model.tensorboard_directory, 'train/'), sess.graph)
self.validation_writer = tf.summary.FileWriter(os.path.join(
self.model.tensorboard_directory, 'validation/'))
self.train_saver = tf.train.Saver()
for step in range(self.training_steps):
start = time.time()
train_bottlenecks, train_ground_truth = (
self._get_random_distorted_bottlenecks(sess,
self.training_batch_size,
self.IMAGE_CATEGORY_TRAINING,
self.train_bottleneck_tensor,
self.train_resized_input_tensor))
# Feed the bottlenecks and ground truth into the graph, and run a training
# step. Capture training summaries for TensorBoard with the `merged` op.
train_summary, _ = sess.run(
[self.merged, self.train_step],
feed_dict={self.train_bottleneck_input: train_bottlenecks,
self.train_ground_truth_input: train_ground_truth})
train_time = time.time() - start
self.train_writer.add_summary(train_summary, step)
is_last_step = (step + 1 == self.training_steps)
if (step % self.eval_step_interval) == 0 or is_last_step:
train_accuracy, cross_entropy_value = sess.run(
[evaluation_step, self.cross_entropy],
feed_dict={self.train_bottleneck_input: train_bottlenecks,
self.train_ground_truth_input: train_ground_truth})
validation_bottlenecks, validation_ground_truth, _ = (
self._get_random_bottlenecks(sess,
self.validation_batch_size,
self.IMAGE_CATEGORY_VALIDATION,
self.train_bottleneck_tensor,
self.train_resized_input_tensor))
validation_summary, validation_accuracy = sess.run(
[self.merged, evaluation_step],
feed_dict={self.train_bottleneck_input: validation_bottlenecks,
self.train_ground_truth_input: validation_ground_truth})
self.validation_writer.add_summary(validation_summary, step)
Now my tensorboard is tracking all sorts of variables relating to the self.training_graph - accuracy, cross entropy, information about the weights and what not.
All I want to do is have another graph on tensorboard that tracks the average runtime of each training step. If I time the step, (see train_time), how do I put these into an ever increasing array and show it in tensorboard for this graph?
The issue seems to be that these values aren't apart of my main model graph, they're different values. If I make them with a new graph that simple appends new runtimes then they don't show up in tensorboard. I could make them apart of the graph but that seems dumb.. why would my complicated ML graph have a random part that caluclates the average training iteration runtime?
I would use a helper library like https://github.com/lanpa/tensorboardX which abstracts away an annoying additional session call.

How to feed the list of gradients, or (grad, variable name) pairs, to my model

This is related to a previous question: How to partition a single batch into many invocations to save memory, and also to How to train a big model with relatively large batch size on a single GPU using Tensorflow?; but, still I couldn't find the exact answer. For example, the answer to another related question tensorflow - run optimizer op on a large batch doesn't work for me (btw. it wasn't accepted and there are no more comments there).
I want to try to simulate larger batch size but using only one GPU.
So, I need to compute the gradients for every smaller batch, aggregate/average them across several such smaller batches, and only then apply.
(Basically, it's like synchronized distributed SGD, but on a single device/GPU, performed serially. Of course, the acceleration advantage of distributed SGD is lost but larger batch size itself will maybe enable convergence to larger accuracy and larger step size, as indicated by a few recent papers.)
To keep memory requirement low, I should do standard SGD with small batches, update the gradients after every iteration and then call optimizer.apply_gradients() (where optimizer is one of the implemented optimizers).
So, everything looks simple but when I go to implement it, it is actually not so trivial.
For example, I would like to use one Graph, compute gradients for each iteration, and then, when multiple batches are processed, sum the gradients and pass them to my model. But the list itself can't be fed into the feed_dict parameter of sess.run. Also, passing gradients directly doesn't exactly work, I get the TypeError: unhashable type: 'numpy.ndarray' (I think the reason is that I can't pass in the numpy.ndarray, only tensorflow variable).
I could define a placeholder for the gradients, but for that I would need tu build the model first (to specify the trainable variables etc.).
All in all, please tell me there is a simpler way to implement this.
There is no simpler way than what you have already been told. That way may seem complicated at first, but it actually is really simple. You just have to use the low level API to manually calculate the gradients for each batch, average over them and than manually feed the averaged gradients to the optimizer to apply them.
I'll try to provide some stripped down code of how to do this. I'll use dots as placeholders for actual code which would depend on the problem. What you would usually do would be something like this:
import tensorflow as tf
[...]
input = tf.placeholder(...)
[...]
loss = ...
[...]
# initialize the optimizer
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
# define operation to apply the gradients
minimize = optimizer.minimize(loss)
[...]
if __name__ == '__main__':
session = tf.Session(config=CONFIG)
session.run(tf.global_variables_initializer())
for step in range(1, MAX_STEPS + 1):
data = ...
loss = session.run([minimize, loss],
feed_dict={input: data})[1]
What you want to do instead now, to average over multiple batches to preserver memory would be this:
import tensorflow as tf
[...]
input = tf.placeholder(...)
[...]
loss = ...
[...]
# initialize the optimizer
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
# grab all trainable variables
trainable_variables = tf.trainable_variables()
# define variables to save the gradients in each batch
accumulated_gradients = [tf.Variable(tf.zeros_like(tv.initialized_value()),
trainable=False) for tv in
trainable_variables]
# define operation to reset the accumulated gradients to zero
reset_gradients = [gradient.assign(tf.zeros_like(gradient)) for gradient in
accumulated_gradients]
# compute the gradients
gradients = optimizer.compute_gradients(loss, trainable_variables)
# Note: Gradients is a list of tuples containing the gradient and the
# corresponding variable so gradient[0] is the actual gradient. Also divide
# the gradients by BATCHES_PER_STEP so the learning rate still refers to
# steps not batches.
# define operation to evaluate a batch and accumulate the gradients
evaluate_batch = [
accumulated_gradient.assign_add(gradient[0]/BATCHES_PER_STEP)
for accumulated_gradient, gradient in zip(accumulated_gradients,
gradients)]
# define operation to apply the gradients
apply_gradients = optimizer.apply_gradients([
(accumulated_gradient, gradient[1]) for accumulated_gradient, gradient
in zip(accumulated_gradients, gradients)])
# define variable and operations to track the average batch loss
average_loss = tf.Variable(0., trainable=False)
update_loss = average_loss.assign_add(loss/BATCHES_PER_STEP)
reset_loss = average_loss.assign(0.)
[...]
if __name__ == '__main__':
session = tf.Session(config=CONFIG)
session.run(tf.global_variables_initializer())
data = [batch_data[i] for i in range(BATCHES_PER_STEP)]
for batch_data in data:
session.run([evaluate_batch, update_loss],
feed_dict={input: batch_data})
# apply accumulated gradients
session.run(apply_gradients)
# get loss
loss = session.run(average_loss)
# reset variables for next step
session.run([reset_gradients, reset_loss])
This should be runnable if you fill in the gaps. However I might have made a mistake while stripping it down and pasting it here. For a runnable example you can take a look into a project I am currently working on myself.
I also want to make clear that this is not the same as evaluating the loss for all the batch data at once, since you average over the gradients. This is especially important when your loss does not work well with low statistics. Take a chi square of histograms for example, calculating the average gradients for a chi square of histograms with low bin counts won't be as good as calculating the gradient on just one histogram with all the bins filled up at once.
You would need to give the gradients as the values that get passed to apply_gradients. It can be placeholders, but it is probably easier to use the usual compute_gradients/apply_gradients combination:
# Some loss measure
loss = ...
optimizer = ...
gradients = optimizer.compute_gradients(loss)
# gradients is a list of pairs
_, gradient_tensors = zip(*gradients)
# Apply gradients as usual
train_op = optimizer.apply_gradients(gradients)
# On training
# Compute some gradients
gradient_values = session.run(gradient_tensors, feed_dict={...})
# gradient_values is a sequence of numpy arrays with gradients
# After averaging multiple evaluations of gradient_values apply them
session.run(train_op, feed_dict=dict(zip(gradient_tensors, gradient_values_average)))
If you want to compute the averages of the gradients within TensorFlow too, that requires a bit of extra code specifically for that, maybe something like this:
# Some loss measure
loss = ...
optimizer = ...
gradients = optimizer.compute_gradients(loss)
# gradients is a list of pairs
_, gradient_tensors = zip(*gradients)
# Apply gradients as usual
train_op = optimizer.apply_gradients(gradients)
# Additional operations for gradient averaging
gradient_placeholders = [tf.placeholder(t.dtype, (None,) + t.shape)
for t in gradient_tensors]
gradient_averages = [tf.reduce_mean(p, axis=0) for p in gradient_placeholders]
# On training
gradient_values = None
# Compute some gradients
for ...: # Repeat for each small batch
gradient_values_current = session.run(gradient_tensors, feed_dict={...})
if gradient_values is None:
gradient_values = [[g] for g in gradient_values_current]
else:
for g_list, g in zip(gradient_values, gradient_values_current):
g_list.append(g)
# Stack gradients
gradient_values = [np.stack(g_list) for g_list in gradient_values)
# Compute averages
gradient_values_average = session.run(
gradient_averages, feed_dict=dict(zip(gradient_placeholders, gradient_values)))
# After averaging multiple gradients apply them
session.run(train_op, feed_dict=dict(zip(gradient_tensors, gradient_values_average)))

rationale behind the evaluation in tensorflow's tutorial code cifar10_eval.py

In TF's official tutorial code 'cifar10', there is an evaluation snippet:
def evaluate():
with tf.Graph().as_default() as g:
# Get images and labels for CIFAR-10.
eval_data = FLAGS.eval_data == 'test'
images, labels = cifar10.inputs(eval_data=eval_data)
# Build a Graph that computes the logits predictions from the
# inference model.
logits = cifar10.inference(images)
# Calculate predictions.
top_k_op = tf.nn.in_top_k(logits, labels, 1)
# Restore the moving average version of the learned variables for eval.
variable_averages = tf.train.ExponentialMovingAverage(
cifar10.MOVING_AVERAGE_DECAY)
variables_to_restore = variable_averages.variables_to_restore()
saver = tf.train.Saver(variables_to_restore)
# Build the summary operation based on the TF collection of Summaries.
summary_op = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter(FLAGS.eval_dir, g)
while True:
eval_once(saver, summary_writer, top_k_op, summary_op)
if FLAGS.run_once:
break
time.sleep(FLAGS.eval_interval_secs)
At runtime, it evaluates one batch of test samples and prints out 'precision' in the console every other eval_interval_secs, my questions are:
each time eval_once() is executed, one batch of samples (128) are dequeued from the data queue, but why I didn't see the evaluation stop after enough batches, 10000/128 + 1 = 79 batches? I thought it should stop after 79 batches.
Are batches from the first 79 sampling mutually exclusive? I'd assume so but want to double-check this.
If each batch is indeed dequeued from the data queue, what are the samples after 79 times of sampling? some random sampling from the entire duplicate data queue again?
since in_top_k() is taking in some unnormalized logit values and output a string of booleans, this masks the internal conversions of softmax() + thresholding. Is there a TF op for such explicit computations? Ideally, it'd be useful to be able to tune the threshold and see different classification results.
Please help.
Thanks!
You can see the following line in "inputs" def of cifar10_input.py
filename_queue = tf.train.string_input_producer(filenames)
More about tf.train.string_input_producer :
string_input_producer(
string_tensor,
num_epochs=None,
shuffle=True,
seed=None,
capacity=32,
shared_name=None,
name=None,
cancel_op=None
)
num_epochs : produces each string from string_tensor num_epochs times before generating an OutOfRange error. If not specified, string_input_producer can cycle through the strings in string_tensor an unlimited number of times.
In our case, num_epochs is not specified. That's why it does not stop after few batches. It can run unlimited times.
By default, shuffle option is set to True in tf.train.string_input_producer. So, it shuffles the data first and copies that shuffled 10K filenames again and again.
Therefore, it's mutually exclusive. You can print filenames to see this.
As explained in 1, they are repeated samples. (not any random data)
You could avoid using tf.nn.in_top_k. Use tf.nn.softmax and tf.greater_equal to obtain boolean tensor that has softmax value above the specific threshold.
I hope this helps. Please comment if there is any misunderstanding.

Computing exact moving average over multiple batches in tensorflow

During training, I would like to write the average loss over the last N mini-batches to SummaryWriter as a way of smoothing the very noisy batch loss. It's easy to compute this in python and print it, but I would like to add this to a summary so that I can see it in tensorboard. Here's an overly simplified example of what I'm doing now.
losses = []
for i in range(10000):
_, loss = session.run([train_op, loss_op])
losses.append(loss)
if i % 100 == 0:
# How to produce a scalar_summary here?
print sum(losses)/len(losses)
losses = []
I'm aware that I could use ExponentialMovingAverage with a decay of 1.0, but I would still need some way to reset this every N batches. Really, if all I care about is visualizing loss in tensorboard, the reset probably isn't necessary, but I'm still curious how one would go about aggregating across batches for other reasons (e.g. computing total accuracy over a test dataset that is too big to run in a single batch).
You can manually construct the Summary object, like this:
from tensorflow.core.framework import summary_pb2
def make_summary(name, val):
return summary_pb2.Summary(value=[summary_pb2.Summary.Value(tag=name,
simple_value=val)])
summary_writer.add_summary(make_summary('myvalue', myvalue), step)
Passing data from python to a graph function like tf.scalar_summary can be done using a placeholder and feed_dict.
average_pl = tf.placeholder(tf.float32)
average_summary = tf.summary.scalar("average_loss", average_pl)
writer = tf.summary.FileWriter("/tmp/mnist_logs", sess.graph_def)
losses = []
for i in range(10000):
_, loss = session.run([train_op, loss_op])
losses.append(loss)
if i % 100 == 0:
# How to produce a scalar_summary here?
feed = {average_pl: sum(losses)/len(losses)}
summary_str = sess.run(average_summary, feed_dict=feed)
writer.add_summary(summary_str, i)
losses = []
I haven't tried it and this was hastily copied from the visualizing data how to but I expect something like this would work.