Running training the discriminator with more examples - tensorflow

As I understand what of the diff between regular GAN to WGAN is that we train the discriminator/critic with more examples in each epoch. If in the regular gan we have in each epoch one batch for both modules, in WGAN we will have 5 batches (or more) for the discriminator and one for the generator.
So basically we have another inner loop for the discriminator :
real_images_labels = np.ones((BATCH_SIZE, 1))
fake_images_labels = -real_images_labels
for epoch in range(epochs):
for batch in range(NUM_BACHES):
for critic_iter in range(n_critic):
random_batches_idx = np.random.randint(0, NUM_BACHES) # Choose random batch from dataset
imgs_data=dataset_list[random_batches_idx]
c_loss_real = critic.train_on_batch(imgs_data, real_images_labels) # update the weights after 1 batch
noise = tf.random.normal([imgs_data.shape[0], noise_dim]) # Generate noise data
generated_images = generator(noise, training=True)
c_loss_fake = critic.train_on_batch(generated_images, fake_images_labels) # update the weights after 1 batch
imgs_data=dataset_list[batch]
noise = tf.random.normal([imgs_data.shape[0], noise_dim]) # Generate noise data
gen_loss_batch = gen_loss_batch + gan.train_on_batch(noise,real_images_labels)
The training is taking me a lot of time, per epoch about 3m. The idea I had to decrease the training time is instead running forward for each batch n_critic times I can increase the batch_size for the discriminator and run forward one time with a bigger batch_size.
I am seeking feedback: does it sound reasonable?
(I didn't paste my entire code, it was just a part of it).

Yes, it does sound reasonable typically increasing batch_size during training, typically decreases the training time with a cost of using more memory and lower accuracy (lower generalization ability).
Having said this you should do always do trial and error with regards to batching as extreme values may or may not increase the training time.
For further discussion you can refer to this question

Related

What does steps mean in the train method of tf.estimator.Estimator?

I'm completely confused with the meaning of epochs, and steps. I also read the issue What is the difference between steps and epochs in TensorFlow?, But I'm not sure about the answer. Consider this part of code:
EVAL_EVERY_N_STEPS = 100
MAX_STEPS = 10000
nn = tf.estimator.Estimator(
model_fn=model_fn,
model_dir=args.model_path,
params={"learning_rate": 0.001},
config=tf.estimator.RunConfig())
for _ in range(MAX_STEPS // EVAL_EVERY_N_STEPS):
print(_)
nn.train(input_fn=train_input_fn,
hooks=[train_qinit_hook, step_cnt_hook],
steps=EVAL_EVERY_N_STEPS)
if args.run_validation:
results_val = nn.evaluate(input_fn=val_input_fn,
hooks=[val_qinit_hook,
val_summary_hook],
steps=EVAL_STEPS)
print('Step = {}; val loss = {:.5f};'.format(
results_val['global_step'],
results_val['loss']))
end
Also, the number of training samples is 400. I consider the MAX_STEPS // EVAL_EVERY_N_STEPS equal to epochs (or iterations). Indeed, the number of epochs is 100. What does the steps mean in nn.train?
In Deep Learning:
an epoch means one pass over the entire training set.
a step or iteration corresponds to one forward pass and one backward pass.
If your dataset is not divided and passed as is to your algorithm, each step corresponds to one epoch, but usually, a training set is divided into N mini-batches. Then, each step goes through one batch and you need N steps to complete a full epoch.
Here, if batch_size == 4 then 100 steps are indeed equal to one epoch.
epochs = batch_size * steps // n_training_samples

How to make a selective back-propagation in a mini-batch in Tensorflow?

Recently, I'm working on a project "predicting future trajectories of objects from their past trajectories by using LSTMs in Tensorflow."
(Here, a trajectory means a sequence of 2D positions.)
Input to the LSTM is, of course, 'past trajectories' and output is 'future trajectories'.
The size of mini-batch is fixed when training. However, the number of past trajectories in a mini-batch can be different. For example, let the mini-batch size be 10. If I have only 4 past trajectories for the current training iteration, 6 out of 10 in the mini-batch is padded with zero value.
When calculating the loss for the back-propagation, I let the loss from the 6 be zero so that the only 4 contribute to the back-propagation.
The problem that I concern is..it seems that Tensorflow still calculates gradients for the 6 even if their loss is zero. As a result, the training speed becomes slower as I increase the mini-batch size even if I used the same training data.
I also used tf.where function when calculating the loss. However, the training time does not decrease.
How can I reduce the training time?
Here I attached my pseudo code for training.
# For each frame in a sequence
for f in range(pred_length):
# For each element in a batch
for b in range(batch_size):
with tf.variable_scope("rnnlm") as scope:
if (f > 0 or b > 0):
scope.reuse_variables()
# for each pedestrian in an element
for p in range(MNP):
# ground-truth position
cur_gt_pose = ...
# loss mask
loss_mask_ped = ... # '1' or '0'
# go through RNN decoder
output_states_dec_list[b][p], zero_states_dec_list[b][p] = cell_dec(cur_embed_frm_dec,
zero_states_dec_list[b][p])
# fully connected layer for output
cur_pred_pose_dec = tf.nn.xw_plus_b(output_states_dec_list[b][p], output_wd, output_bd)
# go through embedding function for the next input
prev_embed_frms_dec_list[b][p] = tf.reshape(tf.nn.relu(tf.nn.xw_plus_b(cur_pred_pose_dec, embedding_wd, embedding_bd)), shape=(1, rnn_size))
# calculate MSE loss
mse_loss = tf.reduce_sum(tf.pow(tf.subtract(cur_pred_pose_dec, cur_gt_pose_dec), 2.0))
# only valid ped's traj contributes to the loss
self.loss += tf.multiply(mse_loss, loss_mask_ped)
I think you're looking for the function tf.stop_gradient. Using this, you could do something like tf.where(loss_mask, tensor, tf.stop_gradient(tensor)) to achieve the desired result, assuming that the dimensions are correct.
However, it looks like this is probably not your issue. It seems as though for each item in your dataset, you are defining new graph nodes. This is not how TensorFlow is supposed to function, you should only have one graph, built beforehand that performs some fixed function, regardless of the batch size. You should definitely not be defining new nodes for every element in the batch, since that cannot efficiently take advantage of parallelism.

In tensorflow estimator class, what does it mean to train one step?

Specifically, within one step, how does it training the model? What is the quitting condition for the gradient descent and back propagation?
Docs here: https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator#train
e.g.
mnist_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn)
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": X_train},
y=y_train,
batch_size=50,
num_epochs=None,
shuffle=True)
mnist_classifier.train(
input_fn=train_input_fn,
steps=100,
hooks=[logging_hook])
I understand that training one step means that we feed the neural network model with batch_size many data points once. My questions is, within this one step, how many times does it perform gradient descent? Does it do back propagation and gradient descent just once or does it keep performing gradient descent until the model weights reach a optimal for this batch of data?
In addition to #David Parks answer, using batches for performing gradient descent is referred to as stochastic gradient descent. Instead of updating the weights after each training sample, you average over the sum of gradients of the batch and use this new gradient to update your weights.
For example, if you have 1000 trainings samples and use batches of 200, you calculate the average gradient for 200 samples, and update your weights with it. That means that you only perform 5 updates overall instead of updating your weights 1000 times. On sufficiently big data sets, you will experience a much faster training process.
Michael Nielsen has a really nice way to explain this concept in his book.
1 step = 1 gradient update. And each gradient update step requires one forward pass and one backward pass.
The stopping condition is generally left up to you and is arguably more art than science. Commonly you will plot (tensorboard is handy here) your cost, training accuracy, and periodically your validation set accuracy. The low point on validation accuracy is generally a good point to stop. Depending on your dataset validation accuracy may drop and at some point increase again, or it may simply flatten out, at which point the stopping condition often correlates with the developer's degree of impatience.
Here's a nice article on stopping conditions, a google search will turn up plenty more.
https://stats.stackexchange.com/questions/231061/how-to-use-early-stopping-properly-for-training-deep-neural-network
Another common approach to stopping is to drop the learning rate every time you compute that no change has occurred to validation accuracy for some "reasonable" number of steps. When you've effectively hit 0 learning rate, you call it quits.
The input function emits batches (when num_epochs=None, num_batches is infinite):
num_batches = num_epochs * (num_samples / batch_size)
One step is processing 1 batch, if steps > num_batches, the training will stop after num_batches.

Understanding tf.metrics and slims streaming metric

I'm not sure if I understand tf.metrics and tf.contrib.slim.metrics correctly.
Here is the general flow of the program:
# Setup of the neural network...
# Adding some metrics
dict_metrics[name] = compute_metric_and_update_op()
# Getting a list of all metrics and updates
names_to_values, names_to_updates = slim.metrics.aggregate_metric_map(dict_metrics)
# Calling tf.slim evaluate
slim.evaluation.evaluation_loop(eval_op=list(names_to_updates.values()), ...)
Let's assume I want to compute the accuracy. I have two options:
a) Compute the accuracy over all pixels in all images in all batches
b) Compute the accuracy over all pixels in one image and take the average of all accuracies over all images in all batches.
For version a) this is what I would write:
name = "slim/accuracy_metric"
dict_metrics[name] = slim.metrics.streaming_accuracy(
labels, predictions, weights=weights, name=name)
Which should be equivalent to:
name = "accuracy_metric"
accuracy, update_op = tf.metrics.accuracy(
labels, predictions, weights=weights, name=name)
dict_metrics[name] = (accuracy, update_op)
Furthermore, it should be pointless or even wrong to add this line
dict_metrics["stream/" + name] = slim.metrics.streaming_mean(accuracy)
Because the accuracy I get from tf.metrics.accuracy is already computed over all the batches via the update_op. Correct?
If I go with option b), I can achieve the effect like this:
accuracy = my_own_compute_accuracy(labels, predictions)
dict_metrics["stream/accuracy_own"] = \
slim.metrics.streaming_mean(accuracy)
Where my_own_compute_accuracy() computes the symbolic accuracy for the labels and predictions tensor but does not return any update operation. In fact, does this version calculate the accuracy over a single image or a single batch? Basically, if I set the batch size to the size of the complete dataset, does this metric then matches the output of slim.metrics.streaming_accuracy?
Lastly, if I add the same update operation twice, will it be called twice?
Thank you!
Yes, the slim streaming accuracy computes the mean of the per-batch accuracy over the entire dataset (if you only do one epoch).
For your accuracy function it depends on how you implement it.

Breaking down Tensorflow performance with timeline and benchmarking

Using TF 0.12.1, we are trying to understand how the performance of Tensorflow breaks down. In particular, we are looking at the Inception-v3 model, and how long the forward pass step takes.
The first step we looked at was to run a benchmark on just in the inference step. To avoid queueing time, we set the training example to a constant tensor and run it through the inception model. The train method in the code is below
def train(dataset):
"""Train on dataset for a number of steps."""
with tf.Graph().as_default(), tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (dataset.num_examples_per_epoch() /
FLAGS.batch_size)
decay_steps = int(num_batches_per_epoch * FLAGS.num_epochs_per_decay)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(FLAGS.initial_learning_rate,
global_step,
decay_steps,
FLAGS.learning_rate_decay_factor,
staircase=True)
# Create an optimizer that performs gradient descent.
opt = tf.train.RMSPropOptimizer(lr, RMSPROP_DECAY,
momentum=RMSPROP_MOMENTUM,
epsilon=RMSPROP_EPSILON)
# Get images and labels for ImageNet and split the batch across GPUs.
assert FLAGS.batch_size % FLAGS.num_gpus == 0, (
'Batch size must be divisible by number of GPUs')
split_batch_size = int(FLAGS.batch_size / FLAGS.num_gpus)
num_classes = dataset.num_classes() + 1
# Calculate the gradients for each model tower.
tower_grads = []
reuse_variables = None
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (inception.TOWER_NAME, i)) as scope:
# Force all Variables to reside on the CPU.
with slim.arg_scope([slim.variables.variable], device='/cpu:0'):
# Calculate the loss for one tower of the ImageNet model. This
# function constructs the entire ImageNet model but shares the
# variables across all towers.
image_shape = (FLAGS.batch_size, FLAGS.image_size, FLAGS.image_size, 3)
labels_shape = (FLAGS.batch_size)
images = tf.zeros(image_shape, dtype=tf.float32)
labels = tf.zeros(labels_shape, dtype=tf.int32)
logits = _tower_loss(images, labels, num_classes,
scope, reuse_variables)
# Reuse variables for the next tower.
reuse_variables = True
# Build an initialization operation to run below.
init = tf.initialize_all_variables()
# Start running operations on the Graph. allow_soft_placement must be set to
# True to build towers on GPU, as some of the ops do not have GPU
# implementations.
sess = tf.Session(config=tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=FLAGS.log_device_placement))
sess.run(init)
# Start the queue runners.
tf.train.start_queue_runners(sess=sess)
for step in xrange(FLAGS.max_steps):
start_time = time.time()
loss_value = sess.run(logits)
duration = time.time() - start_time
examples_per_sec = FLAGS.batch_size / float(duration)
format_str = ('%s: step %d, loss =(%.1f examples/sec; %.3f '
'sec/batch)')
print(format_str % (datetime.now(), step,
examples_per_sec, duration))
For 8 GPUs, a batch size of 32, and 1 param server, we observe 0.44 seconds per logits operation which does the forward pass. However, when we run the timeline tool, we observe a much smaller inference time (see figure below). For the GPU runtime, observe that there is an initial burst followed by a break, followed by a longer GPU burst. We assume the initial burst is the forward pass while the second burst is the backpropagation.
If the initial burst really is the forward pass time, it is substantially less than 0.44 seconds. Can anyone explain the discrepancy between these results? Is it a mistake with the benchmarking app or is the timeline tool not capturing the full picture? Additionally, there are a couple of GPU operations before the first large burst that we cannot really explain. Any insight into this would be very much appreciated!
TensorFlow has undergone a number of significant performance improvements since TF 0.12.1. If you are interested in solid performance numbers, please use the latest version of TensorFlow, or version 1.2 when it is released.
If you would like to work from a high-performance model as a starting point, I strongly recommend working from https://github.com/tensorflow/benchmarks which include an Inception-v3 model.
As for trying to understand the detailed performance of a single step, I recommend instrumenting the C++ TensorFlow runtime. (The overhead from within Python can be significant, and could introduce uncertainty in your measurements.)
Additionally, it's important to run the experiment a number of iterations to allow the system to "warm up" and fully initialize.
One thing to note: if you are trying to tune your model, be sure to avoid setting allow_soft_placement=True. For now, it's better to ensure that all operations you expect are truly placed on the GPUs. You can confirm by looking at the log output controlled by the log_device_placement parameter.