Gluon code not much faster on GPU than on CPU - mxnet

we're training a network for a recommender system, on triplets. The core code for the fit method is as follows:
for e in range(epochs):
start = time.time()
cumulative_loss = 0
for i, batch in enumerate(train_iterator):
# Forward + backward.
with autograd.record():
output = self.model(batch.data[0])
loss = loss_fn(output, batch.label[0])
# Calculate gradients
loss.backward()
# Update parameters of the network.
trainer_fn.step(batch_size)
# Calculate training metrics. Sum losses of every batch.
cumulative_loss += nd.mean(loss).asscalar()
train_iterator.reset()
where the train_iterator is a custom iterator class that inherits from mx.io.DataIter, and is returning the data ( triples) already in the appropriate context, as:
data = [mx.nd.array(data[:, :-1], self.ctx, dtype=np.int)]
labels = [mx.nd.array(data[:, -1], self.ctx)]
return mx.io.DataBatch(data, labels)
self.model.initialize(ctx=mx.gpu(0)) was also called before running the fit method. loss_fn = gluon.loss.L1Loss().
The trouble is that nvidia-smi reports that the process is correctly allocated into GPU. However, running fit in GPU is not much faster than running it in CPU. In addition, increasing batch_size from 50000 to 500000 increases time per batch by a factor of 10 (which I was not expecting, given GPU parallelization).
Specifically, for a 50k batch:
* output = self.model(batch.data[0]) takes 0.03 seconds on GPU, and 0.08 on CPU.
* loss.backward() takes 0.11 seconds, and 0.39 on CPU.
both assessed with nd.waitall() to avoid asynchronous calls leading to incorrect measurements.
In addition, a very similar code that was running on plain MXNet took less than 0.03 seconds for the corresponding part, which leads to a full epoch taking from slightly above one minute with MXNet, up to 15 minutes with Gluon.
Any ideas on what might be happening here?
Thanks in advance!

The problem is in the following line:
cumulative_loss += nd.mean(loss).asscalar()
When you call asscalar(), MXNet has to implicitly do synchronized call to copy the result from GPU to CPU: it is essentially the same as calling nd.waitall(). Since you do it for every iteration, it is going to do the sync every iteration degrading your wall clock time significantly.
What you can do is to keep and update your cumulative_loss in GPU and copy it to CPU only when you actually need to display it - it can be every N iterations or after epoch is actually done, depending on how long does it take to do each iteration.

Related

Training GPU on multiple minibatches in parallel with TensorFlow

I am using TensorFlow 1.9, on an NVIDIA GPU with 3 GB of memory. The size of my minibatch is 100 MB. Therefore, I could potentially fit multiple minibatches on my GPU at the same time. So my question is about whether this is possible and whether it is standard practice.
For example, when I train my TensorFlow model, I run something like this on every epoch:
loss_sum = 0
for batch_num in range(num_batches):
batch_inputs = get_batch_inputs()
batch_labels = get_batch_labels()
batch_loss, _ = sess.run([loss_op, train_op], feed_dict={inputs: batch_inputs, labels: batch_labels})
loss_sum += batch_loss
loss = batch_loss / num_batches
This iterates over my minibatches and performs one weight update per minibatch. But the size of image_data and label_data is only 100 MB, so the majority of the GPU is not being used.
One option would be to just increase the minibatch size so that the minibatch is closer to the 3 GB GPU capacity. However, I want to keep the same small minibatch size to help with optimisation.
So the other option might be to send multiple minibatches through the GPU in parallel, and perform one weight update per minibatch. Being able to send the minibatches in parallel would significantly reduce the training time.
Is this possible and recommended?
The goal of the Mini Batch approach is to update the weights of your network after each batch is processed and use the updated weights in the next mini-batch. If you do some clever tricks and batch several mini-batches they would effectively use the same old weights.
The only potential benefit I can see is if the model works better with bigger mini-batches, e.g. big_batches * more_epochs is better than mini_batches * less_epochs. I don't remember the theory behind Mini Batch Gradient Descent but I remember there is a reason you should use mini batches instead of the whole training set for each iteration. On the other hand, the mini batch size is a hyperparameter that has to be tuned anyway, so it's probably worth fiddling it a bit.
Thought I might point out that, arbitrarily making the batch size large (when you have large amounts of memory) can be bad sometimes in terms of the generalization of your model.
Reference:
Train longer, generalize better
On Large-Batch Training for Deep Learning.

tensorflow wide linear model inference on gpu slow

I am training a sparse logistic regression model on tensorflow. This problem is specifically about the inference part. I am trying to benchmark inference on on cpu and gpu. I am using the Nvidia P100 gpu (4 dies) on my current GCE box. I am new to gpu so sorry for naive questions.
The model is pretty big ~54k operation (is it considered big compared to dnn or imagenet models ? ) . When i log device placement , i only see gpu:0 being used , and rest of them unused ? I don't do any device placement during training , but during inference i want it to optimally place and use gpu.
Few things i observed : my input node placehoolder (feed_dict) is placed on cpu, so i assume my data is being copied from cpu to gpu ? how does feed_dict exactly work behind the scene ?
1) How can i place my data on which i want to run prediction directly on gpu ? Note : my training runs on distributed cpu with multiple terabytes so i cannot have constant or variable directly in my graph during training , but my inference i can definitely have small batches of data that i would directly like to place on gpu. Are there ways i can achieve this ?
2) Since i am using P100 gpu , i think it has unified memory with host , is it possible to have zerocopy and directly have my data loaded into gpu ? How can i do this from python , java and c++ code. Currently i use feed_dict which from various google sources i think is not at all optimal .
3) Is there some tool or profiler i can use to see when i profile code like :
for epoch_step in epochs:
start_time = time.time()
for i in range(epoch_step):
result = session.run(output, feed_dict={input_example: records_batch})
end_time = time.time()
print("Batch {} epochs {} :time {}".format(batch_size, epoch_step, str(end_time - start_time)))
how much time is being spent on 1) cpu to gpu data transfer 2) session run overhead 3) gpu utilization (currently i use nvidia-smi periodically to monitor
4) kernel call overhead on cpu vs gpu (I assume each invokation of sess.run invokes 1 kernel call right ?
my current bench marking results :
CPU :
Batch size : 10
NumberEpochs TimeGPU TimeCPU
10 5.473 0.484
20 11.673 0.963
40 22.716 1.922
100 56.998 4.822
200 113.483 9.773
Batch size : 100
NumberEpochs TimeGPU TimeCPU
10 5.904 0.507
20 11.708 1.004
40 23.046 1.952
100 58.493 4.989
200 118.272 9.912
Batch size : 1000
NumberEpochs TimeGPU TimeCPU
10 5.986 0.653
20 12.020 1.261
40 23.887 2.530
100 59.598 6.312
200 118.561 12.518
Batch size : 10k
NumberEpochs TimeGPU TimeCPU
10 7.542 0.969
20 14.764 1.923
40 29.308 3.838
100 72.588 9.822
200 146.156 19.542
Batch size : 100k
NumberEpochs TimeGPU TimeCPU
10 11.285 9.613
20 22.680 18.652
40 44.065 35.727
100 112.604 86.960
200 225.377 174.652
Batch size : 200k
NumberEpochs TimeGPU TimeCPU
10 19.306 21.587
20 38.918 41.346
40 78.730 81.456
100 191.367 202.523
200 387.704 419.223
Some notable observations:
As batch size increase i see my gpu utilization increase (reaches to 100% for the only gpu it uses , is there a way i can tell tf to use other gpu too)
at batch size 200k is the only time i see my naive benchmarking shows gpu has minor gain as compared to cpu.
Increasing batch size for a given epoch has minimal effect on time both cpu and gpu until batch size <= 10k. But after that increasing batch size from 10k -> 100k -> 200k the time also increase quite fast i.e for a given epoch let us say 10 batch size 10, 100, 1k, 10k the cpu time and gpu time remain pretty stable ~5-7 sec for gpu and 0.48-0.96 sec for cpu (meaning that sess.run has much higher overhead than computing of graph themselves ?), but increasing batch size further the compute time increase at much faster rate i.e for epoch 10 100k->200k gputime increase from 11 -> 19 sec and cpu time also doubles , why so ? It seems for larger batch size even though i have just one sess.run , but internally it splits that into smaller batch and calls sess.run twice because epoch 20 batch size 100k matches more closely with epoch 10 batch 200k ..
How can i improve my inference further , i believe i am not usding all gpus optimally.
Are there any ideas around how can i benchmark better to get better breakdowns of time for cpu-> gpu transfer and actual speedup for graph computation from moving from cpu to gpu ?
Loading data better directly if possible zero copy into gpu ?
Can i place some nodes to gpu only during inference to get better performance ?
Ideas around quantization or optimizing inference graph ?
Any more ideas to improve gpu based inference . May be xla based optimization or tensrort ? i want to have high performance inference code to run these computations on gpu while the application server crunches on cpu.
One source of information are the TensorFlow docs on performance, including Optimizing for GPU and High Performance Models.
That said, those guides tend to target training more than batch inference, though certainly some of the principles still apply.
I will note that, unless you are using DistributionStrategy, TensorFlow will not automatically put ops on more than a single GPU (source).
In your particularly case, I don't believe GPUs are yet well-tuned to do the type of sparse operation required for your model, so I don't actually expect it to do that well on a GPU (if you log the device placement there's a chance the lookup is done on the CPU). A logistic regression model has only an (sparse) input layer and an output layer, so generally there are very few math ops. GPUs excel the most when they are doing lots of matrix multiplies, convolutions, etc.
Finally, I would encourage you to use TensorRT to optimize your graph, though for your particular model there's no guarantee it does much better.

Training TensorFlow model with summary operations is much slower than without summary operations

I am training an Inception-like model using TensorFlow r1.0 with GPU Nvidia Titan X.
I added some summary operations to visualize the training procedure, using code as follows:
def variable_summaries(var):
"""Attach a lot of summaries to a Tensor (for TensorBoard visualization)."""
with tf.name_scope('summaries'):
mean = tf.reduce_mean(var)
tf.summary.scalar('mean', mean)
with tf.name_scope('stddev'):
stddev = tf.sqrt(tf.reduce_mean(tf.square(var - mean)))
tf.summary.scalar('stddev', stddev)
tf.summary.scalar('max', tf.reduce_max(var))
tf.summary.scalar('min', tf.reduce_min(var))
tf.summary.histogram('histogram', var)
When I run these operations, the time cost of training one epoch is about 400 seconds. But when I turn off these operations, the time cost of training one epoch is just 90 seconds.
How to optimize the graph to minimize the summary operations time cost?
Summaries of course slow down the training process, because you do more operations and you need to write them to disc. Also, histogram summaries slow the training even more, because for histograms you need more data to be copied from GPU to CPU than for scalar values.
So I would try to use histogram logging less often than the rest, that could make some difference.
The usual solution is to compute summaries only every X batches. Since you compute summaries only one per epoch and not every batch, it might be worth trying even less summaries logging.
Depends on how many batches you have in your dataset, but usually you don't lose much information by gathering a bit less logs.

Adam optimizer goes haywire after 200k batches, training loss grows

I've been seeing a very strange behavior when training a network, where after a couple of 100k iterations (8 to 10 hours) of learning fine, everything breaks and the training loss grows:
The training data itself is randomized and spread across many .tfrecord files containing 1000 examples each, then shuffled again in the input stage and batched to 200 examples.
The background
I am designing a network that performs four different regression tasks at the same time, e.g. determining the likelihood of an object appearing in the image and simultanously determining its orientation. The network starts with a couple of convolutional layers, some with residual connections, and then branches into the four fully-connected segments.
Since the first regression results in a probability, I'm using cross entropy for the loss, whereas the others use classical L2 distance. However, due to their nature, the probability loss is around the order of 0..1, while the orientation losses can be much larger, say 0..10. I already normalized both input and output values and use clipping
normalized = tf.clip_by_average_norm(inferred.sin_cos, clip_norm=2.)
in cases where things can get really bad.
I've been (successfully) using the Adam optimizer to optimize on the tensor containing all distinct losses (rather than reduce_suming them), like so:
reg_loss = tf.reduce_sum(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
loss = tf.pack([loss_probability, sin_cos_mse, magnitude_mse, pos_mse, reg_loss])
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate,
epsilon=self.params.adam_epsilon)
op_minimize = optimizer.minimize(loss, global_step=global_step)
In order to display the results in TensorBoard, I then actually do
loss_sum = tf.reduce_sum(loss)
for a scalar summary.
Adam is set to learning rate 1e-4 and epsilon 1e-4 (I see the same behavior with the default value for epislon and it breaks even faster when I keep the learning rate on 1e-3). Regularization also has no influence on this one, it does this sort-of consistently at some point.
I should also add that stopping the training and restarting from the last checkpoint - implying that the training input files are shuffled again as well - results in the same behavior. The training always seems to behave similarly at that point.
Yes. This is a known problem of Adam.
The equations for Adam are
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
where m is an exponential moving average of the mean gradient and v is an exponential moving average of the squares of the gradients. The problem is that when you have been training for a long time, and are close to the optimal, then v can become very small. If then all of a sudden the gradients starts increasing again it will be divided by a very small number and explode.
By default beta1=0.9 and beta2=0.999. So m changes much more quickly than v. So m can start being big again while v is still small and cannot catch up.
To remedy to this problem you can increase epsilon which is 10-8 by default. Thus stopping the problem of dividing almost by 0.
Depending on your network, a value of epsilon in 0.1, 0.01, or 0.001 might be good.
Yes this could be some sort of super complicated unstable numbers/equations case, but most certainty your training rate is just simply to high as your loss quickly decreases until 25K and then oscillates a lot in the same level. Try to decrease it by factor of 0.1 and see what happens. You should be able to reach even lower loss value.
Keep exploring! :)

Outputs of the distributed version of MNIST model

I am playing with the distributed version of the MNIST on CloudML and I am not sure to understand the logs displayed during the training phase:
INFO:root:Train [master/0], step 1693: Loss: 1.176, Accuracy: 0.464 (760.724 sec) 4.2 global steps/s, 4.2 local steps/s
INFO:root:Train [master/0], step 1696: Loss: 1.175, Accuracy: 0.464 (761.420 sec) 4.3 global steps/s, 4.3 local steps/s
INFO:root:Eval, step 1696: Loss: 0.990, Accuracy: 0.537
INFO:root:Train [master/0], step 1701: Loss: 1.175, Accuracy: 0.465 (766.337 sec) 1.0 global steps/s, 1.0 local steps/s
I am batching over 200 examples at a time, randomly.
Why is there such a gap between Train acc/loss and Eval acc/loss, the metrics for the eval set being significantly higher than for the train set, when it is usually the opposite?
Also, what is the difference between the global step and the local step?
The code I am talking about is here. task.py is calling model.py, the file where the graph is created.
When you're doing distributed training you can have more than 1 worker. Each of these workers can compute an update to the parameters. So each time a worker computes an update that counts as 1 local step. Depending on the type of training, synchronous vs. asynchronous training, the updates can be combined in different ways before actually applying the update to the parameters.
For example, every worker might update the parameters, or you might average the updates from each worker and only apply update the parameters once.
The global step tells you how many times you actually updated the parameters. So if you have N workers and you apply each worker's update then N local steps should correspond to N global steps. On the other hand, if you have N workers and you take 1 update from each worker, average them, and then update the parameters, then you'd have 1 global step for every N local steps.
RE: gap between Train acc/loss and Eval acc/loss
This is result of using exponential moving average to smooth out loss and accuracy between batches (otherwise plot is very noisy). The way the metrics are calculated on eval set and train set is different:
the eval set loss and accuracy is calculated using a moving exponential average of all batches of examples in the eval set - after each run of the evaluation the moving average is reset, so each point represents single run - also all eval steps are calculated using a consistent checkpoint (consistent value of weights).
the train set loss and accuracy is calculated using a moving exponential average of all batches of examples in training set during training process - the moving average is never reset so it can carry over past information for long time - and it is not based on a consistent checkpoint. This is a cheap to calculate approximation of the metrics.
We will be providing updated samples calculating loss on training and eval set in consistent way. Surprisingly even with the update initially eval set gets higher accuracy than training set - probably all the data wasn't properly shuffled and randomly split into training and eval set - and eval set contains slightly 'easier' subset of the data. After more training steps the classifier starts to over-fit training data and accuracy on training set exceeds accuracy on eval set.
RE: what is the difference between the global step and the local step?
Local steps is number of batches processed by single worker (each worker logs this information), global steps is number of batches processed by all workers. When single worker is used the two numbers are equal. When using more workers in distributed setting the global step > local step.
The training measures that are being reported are computed on just the "mini-batch" of examples used in that step (200 in your case). The eval measures are reported on the entire evaluation set.
So, the mini-batch training statistics will be quite noisy.
After further investigating the code in the example, the problem is that the accuracy reported over the training data is a moving average. Thus, the average reported at the current step is actually influenced by the average from many steps ago, which is typically lower. We will update the samples by not averaging with previous steps.