GPU OOM after many hours of training - tensorflow

I am training a model within my own loop kind of like this:
while True:
[x,y] = getSomeTrainingData(...),y,...)
What I'm seeing is that it will train for a very long time, but then randomly OOM in the GPU. The batch size and data size is constant. What would cause this and is there anything I can do, like potentially doing some kind of garbage collection between iterations?


What is the purpose of the buffer_size in tensorflow dataset shuffle() method?

I understand what it does, but I have no idea why the whole set isn't considered by default.
I created an input pipeline for training a NN earlier, and shuffled the data, which is about 10110 instances ordered by class of which there are 7, but I set the buffer_size to 100 not suspecting much of it. I trained the model and found it to behave very strangely. It would start with immensely high accuracy for the first couple of batches and then drop drastically thereafter. After about 1 or 2 epochs, the validation accuracy went to 0.
I suspected that the buffer size being too small in the shuffle() method had something to do with it, because it only considered 100 elements to pick randomly from which would result in the same ordered dataset, and with a batch size of 64 in training, the model would update on homogeneous sets each time. And lo and behold, my suspicions were correct. The model performed much better after setting the buffer size to an arbitrarily large value (4000). But I'm still left scratching my head why there's a need for a buffer size anyways, other than for, I suppose memory saving purposes?

TPU terminology confusion

So I know how epochs, train steps, batch sizes and this kind of stuff are defined, but it is really hard to me to get my head wraped around the TPU terminology like train loops, iterations per loop and so on. I read this but Im still confused.
Also how can I benchmark the time for iterations per loop for example.
Any explanation would help me a lot there. Thanks!
As the other answers have described, iterations_per_loop is a tuning parameter that controls the amount of work done by the TPU before checking in with it again. A lower number lets you inspect results (and benchmark them) more often, and a higher number reduces the overhead due to synchronization.
This is no different from familiar network or file buffering techniques; changing its value affects performance, but not your final result. In contrast, ML hyperparameters like num_epochs, train_steps, or train_batch_size will change your result.
EDIT: Adding an illustration in pseudocode, below. Notionally, the training loop functions like this:
def process_on_TPU(examples, train_batch_size, iterations_per_loop):
# The TPU will run `iterations_per_loop` training iterations before returning to the host
for i in range(0, iterations_per_loop):
# on every iteration, the TPU will compute `train_batch_size` examples,
# calculating the gradient from every example in the given batch
compute(examples[i * train_batch_size : (i + 1) * train_batch_size])
# assume each entry in `example` is a single training example
for b in range(0, train_steps, train_batch_size * iterations_per_loop)
process_on_TPU(examples[b:b + train_batch_size * iterations_per_loop],
From this, it might appear that train_batch_size and iterations_per_loop are simply two different ways of accomplishing the same thing. However, this is not the case; train_batch_size affects the learning rate, since (at least in ResNet-50) the gradient is computed at each iteration from the average of the gradient of every example in the batch. Taking 50 steps per 50k examples will produce a different result from taking from 1k steps per 50k examples, since the latter case calculates the gradient much more often.
EDIT 2: Below is a way to visualize what's happening, with a racing metaphor. Think of the TPU as running a race that has a distance of train_steps examples, and its stride lets it cover a batch of examples per step. The race is on a track, which is shorter than the total race distance; the length of the lap is your total number of training examples, and every lap around the track is one epoch. You can think of iterations_per_loop as being the point where the TPU can stop at a "water station" of sorts where the training is temporarily paused for a variety of tasks (benchmarking, checkpointing, other housekeeping).
By "train loop", I'm assuming it's the same meaning as "training loop". The training loop is the one that iterates through each epoch in order to feed the model.
The iterations per loop is related to how Cloud TPU handles the training loop. In order to amortize the TPU launch cost, the model training step is wrapped in a tf.while_loop, such that one Session run actually runs many iterations for a single training loop.
Because of this, Cloud TPU runs a specified number of iterations of the training loop before returning to the host. Therefore, iterations_per_loop is how many iterations will run for one call.
TPU literally means "Tensor Processing Unit", it's a hardware device used for computation in exactly the same way a GPU is used. The TPUs are effectively Google proprietary GPUs. There are technical differences under the hood of a GPU vs a TPU, mostly regarding speed and power consumption, and some issues of floating point precision, but you don't need to care about the details.
iterations_per_loop appears to be an effort to improve efficiency by loading the TPU with multiple training batches. There are often hardware bandwidth limitations when transferring large amounts of data from main memory to a GPU/TPU.
It appears that the code you reference is passing iterations_per_loop number of training batches to the TPU, then running iterations_per_loop number of training steps before pausing to do another data transfer from main memory to TPU memory.
I'm rather surprised to see that though, I would expect that asynchronous background data transfers would be possible by now.
My only disclaimer is that, while I'm proficient with Tensorflow, and have watched TPU evolution in papers and articles, I'm not directly experienced with the Google API or running on TPUs, so I'm inferring from what I read in the documentation you linked to.

OOM after n iterations in tensorflow without further tensor allocation

Several times, when working using as much GPU memory as possible, I've experienced OOM errors only after a certain number of training iterations have passed (without allocating new tensors explicitly). Reducing just a bit the batch size (i.e from 32 to 30) has always solved the problem, but I can't understand what should be causing this behabior.

Tensorflow GPU Memory exhausted during mean squared error

I have a Tensorflow model with is a recurrent neural network using long short term memory. The state size is 3000, each time step of input has 300 inputs, there are about 500 time steps, and 1 output for each time step. I am training a sequence-to-sequence model.
It runs fine for inputs with less than 500 time steps, but somewhere around 500 timesteps, it crashes with the following out of memory error:
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[20375,20375]
[[Node: gradients/mean_squared_error/Mul_grad/mul_1 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](mean_squared_error/Square, gradients/mean_squared_error/Sum_grad/Tile)]]
[[Node: gradients/MatMul_grad/tuple/control_dependency_1/_225 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_5086_gradients/MatMul_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
And this is running on a GPU with 12gb of memory.
I have tried running it on my laptop cpu, and it seems to use very little memory (about 1 to 2 gb), but it's so slow that it never did get to 500 time steps. I'm working on some changes that will make it skip to 500 time steps to see how much memory it uses when not running on a GPU.
My questions is: Where could Tensorflow possibly want to allocate a tensor of shape [20375, 20375]? It seems to be related to the tf.mean_squared_error function, but that doesn't seem like an operation that should require such exorbitant amounts of memory.
I have tried reducing the batch size, but that just pushes the failure point up to a few more time steps, and I'll need up to a few thousand time steps, so this doesn't seem like a good long-term solution. I'd prefer to get to the root of the problem.
Here is the relevant code for the mean squared error:
initial_state_tuple = tf.contrib.rnn.LSTMStateTuple(initial_state, initial_hidden_state)
# Create the actual RNN
with tf.variable_scope(VARIABLE_SCOPE, reuse=None):
cell = tf.contrib.rnn.BasicLSTMCell(STATE_SIZE)
rnn_outputs, finalstate = tf.nn.dynamic_rnn(cell=cell, inputs=networkinput,
with tf.variable_scope(VARIABLE_SCOPE, reuse=True):
weights = tf.get_variable(name=WEIGHTS_NAME, shape=[STATE_SIZE, 1], dtype=tf.float32)
biases = tf.get_variable(name=BIASES_NAME, shape=[1], dtype=tf.float32)
# Build the output layers
rnn_outputs_reshaped = tf.reshape(rnn_outputs, [-1, STATE_SIZE])
network_outputs = tf.sigmoid(tf.matmul(rnn_outputs_reshaped, weights) + biases)
expected_outputs_reshaped = tf.reshape(expected_outputs, [-1, 1])
# Loss mask just cancels out the inputs that are padding characters, since not all inputs have the same number of time steps
loss_mask_reshaped = tf.reshape(loss_mask, shape=[-1])
expected_outputs_reshaped = loss_mask_reshaped * expected_outputs_reshaped
network_outputs = loss_mask_reshaped * network_outputs
loss = tf.losses.mean_squared_error(labels=expected_outputs_reshaped, predictions=network_outputs)
If you want all of the code, it can be found here. The relevant functions are buildtower() and buildgraph(). The constants NUM_GPUS and BATCH_SIZE are set to appropriate values when running on the machine with the GPUs.
Update: I replaced the line
loss = tf.losses.mean_squared_error(labels=expected_outputs_reshaped, predictions=network_outputs)
error_squared = tf.pow(expected_outputs_reshaped - network_outputs, 2)
loss = tf.reduce_mean(error_squared)
and the same error happened. I reduced the state size to 30 and the batch size to 5, and the error still happened, although it did make it up to about 3000 time steps.
Update: After doing some research, I have found that, when training an RNN with a large number of time steps, truncated backpropagation is often used. This leads me to believe that backpropagation through a large number of time steps inherently takes a lot of memory, and my issue is not that I've constructed my graph wrong, but that I have a fundamental misunderstanding of the resource requirements of gradient calculations. To this end, I am working on changing my code to use truncated backpropagation. I will report back with results.
This project is my first experience with machine learning and Tensorflow, and after doing some research, it seems I had some fundamental misunderstandings.
I had thought that memory usage would scale linearly with the number of time steps in my data. Because every other dimension of my model (Batch size, state size) was small, I expected that I could get up to quite a few time steps before running out of memory. However, it seems that memory usage of computing the gradients scales exponentially with the number of time steps, so no matter how small I made the state size and batch size, it would eventually exhaust all my memory because of the large number of time steps.
To deal with this, I am using truncated backpropagation, in which each batch is broken up into chunks of some fixed number of time steps. This is not perfect, because it means that errors can only be propagated back at most this many time steps. However, based on what I've found online, it seems to work well enough, and there's not too many other ways to get around the memory usage issue.
As I said before, this is all my first experience with machine learning, so if anything in here is blatantly wrong, please tell me.

GPU + CPU Tensorflow Training

I have a network, one whose parameter is a large-embedding matrix (3Million X 300 sized), say embed_mat.
During training, for each mini-batch, I only update a small subset of the vectors from embed_mat (max 15000 vectors) which are chosen using the embedding_lookup op. I am using the Adam optimizer to train my model.
As I cannot store this embed_mat in the GPU, due to its size, I define it under CPU (say /cpu:0) device, but the rest of the parameters of the model, the optimizer etc. are defined under a GPU (say, gpu:/0) device.
I see that my GPU usage is very minimal (200 MB), which suggests all my training is happening on the CPU. What I expected was that the result of the embedding_lookup is copied to the GPU and all my training happens there. Am I doing something wrong.
The training time is very largely affected by the size (num_vectors) of the embedding matrix which doesn't seem correct to me. In any mini-batch, I only update my network parameters and the vectors I looked up (~15000), so the training time should, if at all, grow sub-linearly with the size of the embedding matrix.
Is there a way to automatically and seamlessly split up my embed_mat to multiple GPUs for faster training?
I suspect the Adam Optimizer for this. Looks like because the embed_mat is on the CPU, all training is happening on the CPU. Is this correct?
Try visualizing on tensorboard where each of your ops is placed. In the "graph" tab you can color by "device". Ideally the embedding variable, the embedding lookup, and the embedding gradient update should be in the CPU, while most other things should be in the GPU.