I'm not sure if I understand tf.metrics and tf.contrib.slim.metrics correctly.
Here is the general flow of the program:
# Setup of the neural network...
# Adding some metrics
dict_metrics[name] = compute_metric_and_update_op()
# Getting a list of all metrics and updates
names_to_values, names_to_updates = slim.metrics.aggregate_metric_map(dict_metrics)
# Calling tf.slim evaluate
slim.evaluation.evaluation_loop(eval_op=list(names_to_updates.values()), ...)
Let's assume I want to compute the accuracy. I have two options:
a) Compute the accuracy over all pixels in all images in all batches
b) Compute the accuracy over all pixels in one image and take the average of all accuracies over all images in all batches.
For version a) this is what I would write:
name = "slim/accuracy_metric"
dict_metrics[name] = slim.metrics.streaming_accuracy(
labels, predictions, weights=weights, name=name)
Which should be equivalent to:
name = "accuracy_metric"
accuracy, update_op = tf.metrics.accuracy(
labels, predictions, weights=weights, name=name)
dict_metrics[name] = (accuracy, update_op)
Furthermore, it should be pointless or even wrong to add this line
dict_metrics["stream/" + name] = slim.metrics.streaming_mean(accuracy)
Because the accuracy I get from tf.metrics.accuracy is already computed over all the batches via the update_op. Correct?
If I go with option b), I can achieve the effect like this:
accuracy = my_own_compute_accuracy(labels, predictions)
dict_metrics["stream/accuracy_own"] = \
slim.metrics.streaming_mean(accuracy)
Where my_own_compute_accuracy() computes the symbolic accuracy for the labels and predictions tensor but does not return any update operation. In fact, does this version calculate the accuracy over a single image or a single batch? Basically, if I set the batch size to the size of the complete dataset, does this metric then matches the output of slim.metrics.streaming_accuracy?
Lastly, if I add the same update operation twice, will it be called twice?
Thank you!
Yes, the slim streaming accuracy computes the mean of the per-batch accuracy over the entire dataset (if you only do one epoch).
For your accuracy function it depends on how you implement it.
Related
As I understand what of the diff between regular GAN to WGAN is that we train the discriminator/critic with more examples in each epoch. If in the regular gan we have in each epoch one batch for both modules, in WGAN we will have 5 batches (or more) for the discriminator and one for the generator.
So basically we have another inner loop for the discriminator :
real_images_labels = np.ones((BATCH_SIZE, 1))
fake_images_labels = -real_images_labels
for epoch in range(epochs):
for batch in range(NUM_BACHES):
for critic_iter in range(n_critic):
random_batches_idx = np.random.randint(0, NUM_BACHES) # Choose random batch from dataset
imgs_data=dataset_list[random_batches_idx]
c_loss_real = critic.train_on_batch(imgs_data, real_images_labels) # update the weights after 1 batch
noise = tf.random.normal([imgs_data.shape[0], noise_dim]) # Generate noise data
generated_images = generator(noise, training=True)
c_loss_fake = critic.train_on_batch(generated_images, fake_images_labels) # update the weights after 1 batch
imgs_data=dataset_list[batch]
noise = tf.random.normal([imgs_data.shape[0], noise_dim]) # Generate noise data
gen_loss_batch = gen_loss_batch + gan.train_on_batch(noise,real_images_labels)
The training is taking me a lot of time, per epoch about 3m. The idea I had to decrease the training time is instead running forward for each batch n_critic times I can increase the batch_size for the discriminator and run forward one time with a bigger batch_size.
I am seeking feedback: does it sound reasonable?
(I didn't paste my entire code, it was just a part of it).
Yes, it does sound reasonable typically increasing batch_size during training, typically decreases the training time with a cost of using more memory and lower accuracy (lower generalization ability).
Having said this you should do always do trial and error with regards to batching as extreme values may or may not increase the training time.
For further discussion you can refer to this question
After reading GAN tutorials and code samples i still don't understand how generator is trained. Let's say we have simple case:
- generator input is noise and output is grayscale image 10x10
- discriminator input is image 10x10 and output is single value from 0 to 1 (fake or true)
Training discriminator is easy - take its output for real and expect 1 for it. Take output for fake and expect 0. We're working with real output size here - single value.
But training generator is different - we take fake output (1 value) and make expected output for that as one. But it sounds more like training of descriminator again. Output of generator is image 10x10 how can we train it with only 1 single value? How back propagation might work in this case?
To train the generator, you have to backpropagate through the entire combined model while freezing the weights of the discriminator, so that only the generator is updated.
For this, we have to compute d(g(z; θg); θd), where θg and θd are the weights of the generator and discriminator. To update the generator, we can compute the gradient wrt. to θg only ∂loss(d(g(z; θg); θd)) / ∂θg, and then update θg using normal gradient descent.
In Keras, this might look something like this (using the functional API):
genInput = Input(input_shape)
discriminator = ...
generator = ...
discriminator.trainable = True
discriminator.compile(...)
discriminator.trainable = False
combined = Model(genInput, discriminator(generator(genInput)))
combined.compile(...)
By setting trainable to False, already compiled models are not affected, only models compiled in the future are frozen. Thereby, the discriminator is trainable as a standalone model but frozen in the combined model.
Then, to train your GAN:
X_real = ...
noise = ...
X_gen = generator.predict(noise)
# This will only train the discriminator
loss_real = discriminator.train_on_batch(X_real, one_out)
loss_fake = discriminator.train_on_batch(X_gen, zero_out)
d_loss = 0.5 * np.add(loss_real, loss_fake)
noise = ...
# This will only train the generator.
g_loss = self.combined.train_on_batch(noise, one_out)
I guess the best way to understand the Generator training procedure is to revise all training loop.
For each epoch:
Update Discriminator:
forward real images mini-batch pass through the Discriminator;
compute the Discriminator loss and calculate gradients for the backward pass;
generate fake images mini-batch via the Generator;
forward generated fake mini-batch pass through the Discriminator;
compute the Discriminator loss and derive gradients for the backward pass;
add (real mini-batch gradients, fake mini-batch gradients)
update the Discriminator (use Adam or SGD).
Update Generator:
flip the targets: fake images get labeled as real for the Generator. Note: this step ensures using cross-entropy minimization for the Generator. It helps overcome the problem of Generator's vanishing gradients if we continue implementation of the GAN minmax game.
forward fake images mini-batch pass through the updated Discriminator;
compute Generator loss based on the updated Discriminator output, e.g.:
loss function (the probability that fake image is real estimated by Discriminator, 1).
Note: here 1 represents the Generator label for fake images as real.
update the Generator (use Adam or SGD)
I hope this helps. As you can see from the training procedure, GAN players are somewhat "cooperative, in the sense that the discriminator estimates the ratio of data to model distribution densities and then freely shares this information with the generator. From this point of view, the discriminator is more like a teacher instructing the generator in how to improve than an adversary" (cited from I.Goodfellow tutorial).
Recently, I'm working on a project "predicting future trajectories of objects from their past trajectories by using LSTMs in Tensorflow."
(Here, a trajectory means a sequence of 2D positions.)
Input to the LSTM is, of course, 'past trajectories' and output is 'future trajectories'.
The size of mini-batch is fixed when training. However, the number of past trajectories in a mini-batch can be different. For example, let the mini-batch size be 10. If I have only 4 past trajectories for the current training iteration, 6 out of 10 in the mini-batch is padded with zero value.
When calculating the loss for the back-propagation, I let the loss from the 6 be zero so that the only 4 contribute to the back-propagation.
The problem that I concern is..it seems that Tensorflow still calculates gradients for the 6 even if their loss is zero. As a result, the training speed becomes slower as I increase the mini-batch size even if I used the same training data.
I also used tf.where function when calculating the loss. However, the training time does not decrease.
How can I reduce the training time?
Here I attached my pseudo code for training.
# For each frame in a sequence
for f in range(pred_length):
# For each element in a batch
for b in range(batch_size):
with tf.variable_scope("rnnlm") as scope:
if (f > 0 or b > 0):
scope.reuse_variables()
# for each pedestrian in an element
for p in range(MNP):
# ground-truth position
cur_gt_pose = ...
# loss mask
loss_mask_ped = ... # '1' or '0'
# go through RNN decoder
output_states_dec_list[b][p], zero_states_dec_list[b][p] = cell_dec(cur_embed_frm_dec,
zero_states_dec_list[b][p])
# fully connected layer for output
cur_pred_pose_dec = tf.nn.xw_plus_b(output_states_dec_list[b][p], output_wd, output_bd)
# go through embedding function for the next input
prev_embed_frms_dec_list[b][p] = tf.reshape(tf.nn.relu(tf.nn.xw_plus_b(cur_pred_pose_dec, embedding_wd, embedding_bd)), shape=(1, rnn_size))
# calculate MSE loss
mse_loss = tf.reduce_sum(tf.pow(tf.subtract(cur_pred_pose_dec, cur_gt_pose_dec), 2.0))
# only valid ped's traj contributes to the loss
self.loss += tf.multiply(mse_loss, loss_mask_ped)
I think you're looking for the function tf.stop_gradient. Using this, you could do something like tf.where(loss_mask, tensor, tf.stop_gradient(tensor)) to achieve the desired result, assuming that the dimensions are correct.
However, it looks like this is probably not your issue. It seems as though for each item in your dataset, you are defining new graph nodes. This is not how TensorFlow is supposed to function, you should only have one graph, built beforehand that performs some fixed function, regardless of the batch size. You should definitely not be defining new nodes for every element in the batch, since that cannot efficiently take advantage of parallelism.
This is related to a previous question: How to partition a single batch into many invocations to save memory, and also to How to train a big model with relatively large batch size on a single GPU using Tensorflow?; but, still I couldn't find the exact answer. For example, the answer to another related question tensorflow - run optimizer op on a large batch doesn't work for me (btw. it wasn't accepted and there are no more comments there).
I want to try to simulate larger batch size but using only one GPU.
So, I need to compute the gradients for every smaller batch, aggregate/average them across several such smaller batches, and only then apply.
(Basically, it's like synchronized distributed SGD, but on a single device/GPU, performed serially. Of course, the acceleration advantage of distributed SGD is lost but larger batch size itself will maybe enable convergence to larger accuracy and larger step size, as indicated by a few recent papers.)
To keep memory requirement low, I should do standard SGD with small batches, update the gradients after every iteration and then call optimizer.apply_gradients() (where optimizer is one of the implemented optimizers).
So, everything looks simple but when I go to implement it, it is actually not so trivial.
For example, I would like to use one Graph, compute gradients for each iteration, and then, when multiple batches are processed, sum the gradients and pass them to my model. But the list itself can't be fed into the feed_dict parameter of sess.run. Also, passing gradients directly doesn't exactly work, I get the TypeError: unhashable type: 'numpy.ndarray' (I think the reason is that I can't pass in the numpy.ndarray, only tensorflow variable).
I could define a placeholder for the gradients, but for that I would need tu build the model first (to specify the trainable variables etc.).
All in all, please tell me there is a simpler way to implement this.
There is no simpler way than what you have already been told. That way may seem complicated at first, but it actually is really simple. You just have to use the low level API to manually calculate the gradients for each batch, average over them and than manually feed the averaged gradients to the optimizer to apply them.
I'll try to provide some stripped down code of how to do this. I'll use dots as placeholders for actual code which would depend on the problem. What you would usually do would be something like this:
import tensorflow as tf
[...]
input = tf.placeholder(...)
[...]
loss = ...
[...]
# initialize the optimizer
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
# define operation to apply the gradients
minimize = optimizer.minimize(loss)
[...]
if __name__ == '__main__':
session = tf.Session(config=CONFIG)
session.run(tf.global_variables_initializer())
for step in range(1, MAX_STEPS + 1):
data = ...
loss = session.run([minimize, loss],
feed_dict={input: data})[1]
What you want to do instead now, to average over multiple batches to preserver memory would be this:
import tensorflow as tf
[...]
input = tf.placeholder(...)
[...]
loss = ...
[...]
# initialize the optimizer
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
# grab all trainable variables
trainable_variables = tf.trainable_variables()
# define variables to save the gradients in each batch
accumulated_gradients = [tf.Variable(tf.zeros_like(tv.initialized_value()),
trainable=False) for tv in
trainable_variables]
# define operation to reset the accumulated gradients to zero
reset_gradients = [gradient.assign(tf.zeros_like(gradient)) for gradient in
accumulated_gradients]
# compute the gradients
gradients = optimizer.compute_gradients(loss, trainable_variables)
# Note: Gradients is a list of tuples containing the gradient and the
# corresponding variable so gradient[0] is the actual gradient. Also divide
# the gradients by BATCHES_PER_STEP so the learning rate still refers to
# steps not batches.
# define operation to evaluate a batch and accumulate the gradients
evaluate_batch = [
accumulated_gradient.assign_add(gradient[0]/BATCHES_PER_STEP)
for accumulated_gradient, gradient in zip(accumulated_gradients,
gradients)]
# define operation to apply the gradients
apply_gradients = optimizer.apply_gradients([
(accumulated_gradient, gradient[1]) for accumulated_gradient, gradient
in zip(accumulated_gradients, gradients)])
# define variable and operations to track the average batch loss
average_loss = tf.Variable(0., trainable=False)
update_loss = average_loss.assign_add(loss/BATCHES_PER_STEP)
reset_loss = average_loss.assign(0.)
[...]
if __name__ == '__main__':
session = tf.Session(config=CONFIG)
session.run(tf.global_variables_initializer())
data = [batch_data[i] for i in range(BATCHES_PER_STEP)]
for batch_data in data:
session.run([evaluate_batch, update_loss],
feed_dict={input: batch_data})
# apply accumulated gradients
session.run(apply_gradients)
# get loss
loss = session.run(average_loss)
# reset variables for next step
session.run([reset_gradients, reset_loss])
This should be runnable if you fill in the gaps. However I might have made a mistake while stripping it down and pasting it here. For a runnable example you can take a look into a project I am currently working on myself.
I also want to make clear that this is not the same as evaluating the loss for all the batch data at once, since you average over the gradients. This is especially important when your loss does not work well with low statistics. Take a chi square of histograms for example, calculating the average gradients for a chi square of histograms with low bin counts won't be as good as calculating the gradient on just one histogram with all the bins filled up at once.
You would need to give the gradients as the values that get passed to apply_gradients. It can be placeholders, but it is probably easier to use the usual compute_gradients/apply_gradients combination:
# Some loss measure
loss = ...
optimizer = ...
gradients = optimizer.compute_gradients(loss)
# gradients is a list of pairs
_, gradient_tensors = zip(*gradients)
# Apply gradients as usual
train_op = optimizer.apply_gradients(gradients)
# On training
# Compute some gradients
gradient_values = session.run(gradient_tensors, feed_dict={...})
# gradient_values is a sequence of numpy arrays with gradients
# After averaging multiple evaluations of gradient_values apply them
session.run(train_op, feed_dict=dict(zip(gradient_tensors, gradient_values_average)))
If you want to compute the averages of the gradients within TensorFlow too, that requires a bit of extra code specifically for that, maybe something like this:
# Some loss measure
loss = ...
optimizer = ...
gradients = optimizer.compute_gradients(loss)
# gradients is a list of pairs
_, gradient_tensors = zip(*gradients)
# Apply gradients as usual
train_op = optimizer.apply_gradients(gradients)
# Additional operations for gradient averaging
gradient_placeholders = [tf.placeholder(t.dtype, (None,) + t.shape)
for t in gradient_tensors]
gradient_averages = [tf.reduce_mean(p, axis=0) for p in gradient_placeholders]
# On training
gradient_values = None
# Compute some gradients
for ...: # Repeat for each small batch
gradient_values_current = session.run(gradient_tensors, feed_dict={...})
if gradient_values is None:
gradient_values = [[g] for g in gradient_values_current]
else:
for g_list, g in zip(gradient_values, gradient_values_current):
g_list.append(g)
# Stack gradients
gradient_values = [np.stack(g_list) for g_list in gradient_values)
# Compute averages
gradient_values_average = session.run(
gradient_averages, feed_dict=dict(zip(gradient_placeholders, gradient_values)))
# After averaging multiple gradients apply them
session.run(train_op, feed_dict=dict(zip(gradient_tensors, gradient_values_average)))
In TF's official tutorial code 'cifar10', there is an evaluation snippet:
def evaluate():
with tf.Graph().as_default() as g:
# Get images and labels for CIFAR-10.
eval_data = FLAGS.eval_data == 'test'
images, labels = cifar10.inputs(eval_data=eval_data)
# Build a Graph that computes the logits predictions from the
# inference model.
logits = cifar10.inference(images)
# Calculate predictions.
top_k_op = tf.nn.in_top_k(logits, labels, 1)
# Restore the moving average version of the learned variables for eval.
variable_averages = tf.train.ExponentialMovingAverage(
cifar10.MOVING_AVERAGE_DECAY)
variables_to_restore = variable_averages.variables_to_restore()
saver = tf.train.Saver(variables_to_restore)
# Build the summary operation based on the TF collection of Summaries.
summary_op = tf.summary.merge_all()
summary_writer = tf.summary.FileWriter(FLAGS.eval_dir, g)
while True:
eval_once(saver, summary_writer, top_k_op, summary_op)
if FLAGS.run_once:
break
time.sleep(FLAGS.eval_interval_secs)
At runtime, it evaluates one batch of test samples and prints out 'precision' in the console every other eval_interval_secs, my questions are:
each time eval_once() is executed, one batch of samples (128) are dequeued from the data queue, but why I didn't see the evaluation stop after enough batches, 10000/128 + 1 = 79 batches? I thought it should stop after 79 batches.
Are batches from the first 79 sampling mutually exclusive? I'd assume so but want to double-check this.
If each batch is indeed dequeued from the data queue, what are the samples after 79 times of sampling? some random sampling from the entire duplicate data queue again?
since in_top_k() is taking in some unnormalized logit values and output a string of booleans, this masks the internal conversions of softmax() + thresholding. Is there a TF op for such explicit computations? Ideally, it'd be useful to be able to tune the threshold and see different classification results.
Please help.
Thanks!
You can see the following line in "inputs" def of cifar10_input.py
filename_queue = tf.train.string_input_producer(filenames)
More about tf.train.string_input_producer :
string_input_producer(
string_tensor,
num_epochs=None,
shuffle=True,
seed=None,
capacity=32,
shared_name=None,
name=None,
cancel_op=None
)
num_epochs : produces each string from string_tensor num_epochs times before generating an OutOfRange error. If not specified, string_input_producer can cycle through the strings in string_tensor an unlimited number of times.
In our case, num_epochs is not specified. That's why it does not stop after few batches. It can run unlimited times.
By default, shuffle option is set to True in tf.train.string_input_producer. So, it shuffles the data first and copies that shuffled 10K filenames again and again.
Therefore, it's mutually exclusive. You can print filenames to see this.
As explained in 1, they are repeated samples. (not any random data)
You could avoid using tf.nn.in_top_k. Use tf.nn.softmax and tf.greater_equal to obtain boolean tensor that has softmax value above the specific threshold.
I hope this helps. Please comment if there is any misunderstanding.