I am using tensorflow. My gpu memory is not enough, so I want to average the gradients of 4 iterations to update variable.
How to do this in tensorflow?
I met the same problem. I think this example might be useful for your problem. It compute N batches with N GPUs and then do the back propagation once. What you need to do is modify the Line 165-166. Run 'compute_gradients()' 'iter_size' times and run 'average_gradients()' once.
Related
So I know how epochs, train steps, batch sizes and this kind of stuff are defined, but it is really hard to me to get my head wraped around the TPU terminology like train loops, iterations per loop and so on. I read this but Im still confused.
Also how can I benchmark the time for iterations per loop for example.
Any explanation would help me a lot there. Thanks!
As the other answers have described, iterations_per_loop is a tuning parameter that controls the amount of work done by the TPU before checking in with it again. A lower number lets you inspect results (and benchmark them) more often, and a higher number reduces the overhead due to synchronization.
This is no different from familiar network or file buffering techniques; changing its value affects performance, but not your final result. In contrast, ML hyperparameters like num_epochs, train_steps, or train_batch_size will change your result.
EDIT: Adding an illustration in pseudocode, below. Notionally, the training loop functions like this:
def process_on_TPU(examples, train_batch_size, iterations_per_loop):
# The TPU will run `iterations_per_loop` training iterations before returning to the host
for i in range(0, iterations_per_loop):
# on every iteration, the TPU will compute `train_batch_size` examples,
# calculating the gradient from every example in the given batch
compute(examples[i * train_batch_size : (i + 1) * train_batch_size])
# assume each entry in `example` is a single training example
for b in range(0, train_steps, train_batch_size * iterations_per_loop)
process_on_TPU(examples[b:b + train_batch_size * iterations_per_loop],
train_batch_size,
iterations_per_loop)
From this, it might appear that train_batch_size and iterations_per_loop are simply two different ways of accomplishing the same thing. However, this is not the case; train_batch_size affects the learning rate, since (at least in ResNet-50) the gradient is computed at each iteration from the average of the gradient of every example in the batch. Taking 50 steps per 50k examples will produce a different result from taking from 1k steps per 50k examples, since the latter case calculates the gradient much more often.
EDIT 2: Below is a way to visualize what's happening, with a racing metaphor. Think of the TPU as running a race that has a distance of train_steps examples, and its stride lets it cover a batch of examples per step. The race is on a track, which is shorter than the total race distance; the length of the lap is your total number of training examples, and every lap around the track is one epoch. You can think of iterations_per_loop as being the point where the TPU can stop at a "water station" of sorts where the training is temporarily paused for a variety of tasks (benchmarking, checkpointing, other housekeeping).
By "train loop", I'm assuming it's the same meaning as "training loop". The training loop is the one that iterates through each epoch in order to feed the model.
The iterations per loop is related to how Cloud TPU handles the training loop. In order to amortize the TPU launch cost, the model training step is wrapped in a tf.while_loop, such that one Session run actually runs many iterations for a single training loop.
Because of this, Cloud TPU runs a specified number of iterations of the training loop before returning to the host. Therefore, iterations_per_loop is how many iterations will run for one session.run call.
TPU literally means "Tensor Processing Unit", it's a hardware device used for computation in exactly the same way a GPU is used. The TPUs are effectively Google proprietary GPUs. There are technical differences under the hood of a GPU vs a TPU, mostly regarding speed and power consumption, and some issues of floating point precision, but you don't need to care about the details.
iterations_per_loop appears to be an effort to improve efficiency by loading the TPU with multiple training batches. There are often hardware bandwidth limitations when transferring large amounts of data from main memory to a GPU/TPU.
It appears that the code you reference is passing iterations_per_loop number of training batches to the TPU, then running iterations_per_loop number of training steps before pausing to do another data transfer from main memory to TPU memory.
I'm rather surprised to see that though, I would expect that asynchronous background data transfers would be possible by now.
My only disclaimer is that, while I'm proficient with Tensorflow, and have watched TPU evolution in papers and articles, I'm not directly experienced with the Google API or running on TPUs, so I'm inferring from what I read in the documentation you linked to.
I am training a model in several GPUs on a single machine using tensorflow. However, I find the speed is much slower than training on a single GPU. I am wondering if tensorflow executes sub-model in different GPUs in parallel or in a sequential order. For example:
x = 5
y = 2
with tf.device('/gpu:0'):
z1 = tf.multiply(x, y)
with tf.device('/gpu:1'):
z2 = tf.add(x, y)
Are the code inside /gpu:0 and /gpu:1 executes sequentially? If in sequential order, how can I make the two parts execute in parallel? Assume the two parts are not dependent on each other.
In TensorFlow only the second block (inside gpu:1) would execute since nothing depends on the first block.
Yes it is executing sequentially, by its nature the with block will wait till its computation is complete before moving to next code block.
You can implement queues and threading from tensorflow to leverage your additional compute.
Please refer thsi tutorial from tensorflow:
https://www.tensorflow.org/api_guides/python/threading_and_queues
I tried to setup a very simple Mnist example with an Estimator.
First I used the estimator's deprecated fit() parameters x, y and batch_size. This executed very fast and utilized about 100% of my GPU while not effecting the CPU much (about 10% utilization). So it worked as expected.
Because the x, y and batch_size parameters are deprecated, I wanted to use the input_fn parameter for the fit() function. To build the input_fn, I used a tf.slice_input_producer and batched it with tf.train.batch. This is my code https://gist.github.com/andreas-eberle/11f650fca0dce4c9d3d6c0955145e80d. You should be able to just run it with tensorflow 1.0.
My problem is that the training now runs very slow and only utilizes about 30 % of my GPU (shown in nvidia-smi).
I also tried to increase the queue capacity of the slice_input_producer and to increase the number of threads used for batching. However, this only helped to get to about 45% of GPU utilization and resulted in a 100 % GPU utilization.
What am I doing wrong? Is there a better way for feeding the inputs and batching them? I do not want to create the batches manually (creating subarrays of the numpy input array) because I want to use this example for a more complex input queue where I'll be reading and preprocessing the images in the graph.
I don't think my hardware should be the problem:
List item
Windows 10
NVidia GTX 960M
i7-6700HQ
32 GB RAM
What is the use of the function tf.train.get_global_step() in TensorFlow?
In machine learning concepts what is it equivalent to?
You could use it to restart training exactly where you left off when the training procedure has been stopped for some reason. Of course you can always restart training without knowing the global_step (if you save checkpoints regularly in your code, that is), but unless you somehow keep track of how many iterations you already performed, you will not know how many iterations are left after the restart. Sometimes you really want your model to be trained exactly n iterations and not n plus unknown amount before crash. So in my opinion, this is more of a practicality than a theoretical machine learning concept.
tf.train.get_global_step() return global step(variable, tensor from variable node or None) through get_collection(tf.GraphKeys.GLOBAL_STEP) or get_tensor_by_name('global_step:0')
global step is widely used in learn rate decay(like tf.train.exponential_decay, see Decaying the learning rate for more information).
You can pass global step to optimzer apply_gradients or minimize method to increment by one.
while you defined the global step operator, you can get value of it by sess.run(global_step_op)
friends!
I have a question about processing with multiple gpu.
I'm using 4 gpus and tried simple A^n + B^n example in 3 way like below.
Single GPU
with tf.device('/gpu:0'):
....tf.matpow codes...
Multiple GPU
with tf.device('/gpu:0'):
....tf.matpow codes...
with tf.device('/gpu:1'):
....tf.matpow codes...
No specific gpu designated (I think maybe all of gpu used)
....just tf.matpow codes...
when tried this, the result was incomprehensible.
the result was
1. single gpu : 6.x seconds
2. multiple gpu(2 gpus) : 2.x seconds
3. no specific gpu designated(maybe 4 gpus) : 4.x seconds
I cannot understand why #2 is faster than #3.
Anyone can help me?
Thanks.
While the Tensorflow scheduler works well for single GPUs, it is not as good at optimizing the placement of computations on multiple GPUs yet. (Although it is being worked on presently.) Without further details, it's hard to know exactly what's going on. To get a get a better picture, you can log where the computations are actually being placed by the scheduler. You can do this by setting the log_device_placement flag on when creating the tf.Session:
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
In the third code sample (where no GPU was designated) Tensorflow didn't use all of your GPUs. By default if Tensorflow can find a GPU ("/gpu:0") to use it assigns as many calculations as possible to that GPU. You would need to tell it specifically that you want want it to use all 4 like you did in the second code sample.
From the Tensorflow documentation:
If you have more than one GPU in your system, the GPU with the lowest ID will be selected by default. If you would like to run on a different GPU, you will need to specify the preference explicitly:
with tf.device('/gpu:2'):
tf code here