RNN model running out of memory in TensorFlow - tensorflow

I implemented a Sequence to Sequence model using the rnn.rnn helper in TensorFlow.
with tf.variable_scope("rnn") as scope, tf.device("/gpu:0"):
cell = tf.nn.rnn_cell.BasicLSTMCell(4096)
lstm = tf.nn.rnn_cell.MultiRNNCell([cell] * 2)
_, cell = rnn.rnn(lstm, input_vectors, dtype=tf.float32)
tf.get_variable_scope().reuse_variables()
lstm_outputs, _ = rnn.rnn(lstm, output_vectors, initial_state=cell)
The model is running out of memory on a Titan X with 16 GB of memory while allocating gradients for the LSTM cells:
W tensorflow/core/kernels/matmul_op.cc:158] Resource exhausted: OOM when allocating tensor with shape[8192,16384]
W tensorflow/core/common_runtime/executor.cc:1102] 0x2b42f00 Compute status: Resource exhausted: OOM when allocating tensor with shape[8192,16384]
[[Node: gradients/rnn/RNN/MultiRNNCell_1/Cell0/BasicLSTMCell/Linear/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/RNN/MultiRNNCell_1/Cell0/BasicLSTMCell/Linear/concat, gradients/rnn/RNN/MultiRNNCell_1/Cell0/BasicLSTMCell/add_grad/tuple/control_dependency)]]
If I reduce the length of the input and output sequences to 4 or less the model runs without a problem.
This indicates to me that TF is trying to allocate the gradients for all time steps at the same time. Is there a way of avoiding this?

The function tf.gradients as well as the minimize method of the optimizers allow you to set parameter called aggregation_method. The default value is ADD_N. This method constructs the graph in such a way that all gradients need to be computed at the same time.
There are two other undocumented methods called tf.AggregationMethod.EXPERIMENTAL_TREE and tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N, which do not have this requirement.

Related

How to set specific gpu in bert?

ResourceExhaustedError (see above for traceback):
OOM when allocating tensor of shape [768] and type float [[node
bert/encoder/layer_0/attention/output/LayerNorm/beta/adam_m/Initializer/zeros
(defined at /home/zyl/souhu/bert/optimization.py:122) =
Const_class=["loc:#bert/encoder/layer_0/attention/output/LayerNorm/beta/adam_m/Assign"],
dtype=DT_FLOAT, value=Tensor, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
How to set gpu 1 or another to run bert?
The easiest way to set what GPUs will be used is setting CUDA_VISIBLE_DEVICES environment variable. It will still be GPU:0 TensorFlow, different physically different device.
If you are using BERT within Python (which is rather a painful way), you can use the code that is creating BERT graph in a block:
with tf.device('/device:GPU:1'):
model = modeling.BertModel(...)

OOM with tensorflow

I'm facing an OOM error whole training my tensorflow model, the structure is as follows:
tf.contrib.layers.embed_sequence initialized with GoogleNewsVector
2 * tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell) #forward
2 * tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell) #backward
tf.nn.bidirectional_dynamic_rnn wrapping the above layers
tf.layers.dense as an output layer
i tried to reduce the batch size down to as low as 64, my input data is padded to 1500, and my vocab size is 8938
The cluster i'm using is very powerful (https://wiki.calculquebec.ca/w/Helios/en) i'm using two nodes with 8 GPUs each and still getting this error:
2019-02-23 02:55:16.366766: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at reverse_op.cc:270 : Resource exhausted: OOM when
allocating tensor with shape[2000,800,300] and type float on
/job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
I'm using the estimator API with MirroredStrategy and still no use, is there a way maybe to ask tensorflow to just run the training using the GPUs and keep the tensors stores on the main machine memory? Any other suggestions are welcome.
Running a particular operation (e.g. some tensor multiplication during training) using GPU requires having those tensors stored on the GPU.
You might want to use Tensorboard or something like that to see which operations require the most memory for your calculation graph. In particular, it's possible that first link between the embeddings and LSTM is the culprit and you'd need to narrow that somehow.

Calculating Perplexity and Memory Issues in Keras/Tensorflow

I'd like to evaluate my model with Perplexity after each training epoch. I'm using Keras with Tensorflow backend. The problem is, that after each evaluation more and more memory is used but never released. So after a few epochs my system crashes. It would work without the memory issue if I'm not using keras and tensorflow functions. But then it would be waaay too slow.
Here is the code:
def compute_perplexity(self, modelName, sentences):
all_labels, all_predictions = self.predictLabels_for_perplexity_evaluation(self.models[modelName], sentences)
# add an axis to fit tensor shape
for i in range(len(all_labels)):
all_labels[i] = all_labels[i][:,:, np.newaxis]
#calculate perplexity for each sentence length and each datapoint and append to list
perplexity = []
for i in range(10,15): #range(len(all_labels)):
start = time.time()
xentropy = K.sparse_categorical_crossentropy(tf.convert_to_tensor(all_labels[i]), tf.convert_to_tensor(all_predictions[i]))
perplexity.append(K.eval(K.pow(2.0, xentropy)))
print('time for one set of sentences. ', time.time()- start)
#average for each datapoint
for i in range(len(perplexity)):
perplexity[i] = np.average(perplexity[i], axis=1)
perplexity[i] = np.average(perplexity[i])
return np.mean(perplexity)
There is no need to evaluate this metric using TensorFlow, what you code does is to add the all_labels array to the graph each time it is called, which explains the memory usage you are seeing.
Consider implementing all this computation using numpy, or making an operation that you evaluate with new data in a session using feed_dict (without using tf.convert_to_tensor).

No shape error in tensorflow graph construction but getting shape mismatch error during graph computation

There occurs no error in tensorflow graph construction, but I get a shape mismatch error during graph computation in tf.gradients (I guess that the error is in back propagation).
This is the error I get:
InvalidArgumentError (see above for traceback):
Input to reshape is a tensor with 16777216 values, but the requested shape has 4096
[[Node: gradients/truediv_grad/Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0
/device:GPU:0"](gradients/truediv_grad/Sum, gradients/truediv_grad/Shape)]]
I solved the issue , using two techniques:
1.Apparently if you are creating custom ops and gradients , you need to be very explicit in providing the shape information to tensorflow using set_shape or tf.reshape
2.Also when you are registering your gradient using tf.register_gradient which takes op and grad as inputs, you need to be careful while chaining the gradients i.e dy/dx = dy/dz*dz/dx.
Say dy/dz is the custom gradient we have created and dz/dxis the gradient of the previous ops as per the chain rule of differentiation.
tf.register_gradient(Mygrad)
def Mygrad(op,grad):
*****do stuff with op.inputs and calculate custom grads say cust_grad or dy/dz****
return cust_grad*grad
I changed this to following:
tf.register_gradient(Mygrad)
def Mygrad(op,grad):
*****do stuff with op.inputs and calculate custom grads say cust_grad or dy/dz****
return tf.matmul(tf.reshape(cust_grad,[calculated_shape]),tf.reshape(grad,expeced_shape))

In tensorflow, how to calculate sequence loss using output from dynamic_decode

Hi fellow tensorflowers,
I am trying to implement a sequence to sequence model using new seq2seq module that is under development and release with TF1.0 and 1.1.
There is dynamic_decode function here that returns logits in form of rnn_output.
Then, I need to calculate loss using the output of rnn.
When I run it naively, just by calling tf.contrib.seq2seq.loss.sequence_loss with (rnn_output, weights, logits) it crashes with:
InvalidArgumentError (see above for traceback): Incompatible shapes: [1856,1,1024] vs. [9600,1,1024]
[[Node: optimize/gradients/loss/sequence_loss/sampled_softmax_loss/Mul_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](optimize/gradients/loss/sequence_loss/sampled_softmax_loss/Mul_grad/Shape/_3099, optimize/gradients/loss/sequence_loss/sampled_softmax_loss/Mul_grad/Shape_1/_3101)]]
[[Node: optimize/gradients/Add/_824 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:3", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_2787_optimize/gradients/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:3"](^_cloopMainDynamicDecoderWithAttention/decoder/decoder/while/BasicDecoderStep/multi_rnn_cell/cell_1/multi_rnn_cell/cell_2/lstm_cell/zeros/_128)]]
Which is natural, since rnn_output is dynamicly shaped.
I have two possible solutions:
1. "pack" dynamic tensor into a tensor of size equal to maximum allowed length. I don't know how to pack a dynamic tensor into a tensor of fixed size, but it probably has to do smth with new interfaces for dynamic shape: tf.while_loop and TensorArrays. It would be great to hear some advice on that
2. Dynamically calculate sequence_loss. But my knowledge of inner tensorflow implementation is too limited to assess correctly whether it's something easy to do. Any suggestions here?
The general question
What is a right approach to calculate sampled/normal softmax cross-entropy loss from dynamicaly shaped rnn_output of dynamic_decode?
I have the following code:
decoder_outputs, decoder_state = seq2seq.dynamic_decode(my_decoder, output_time_major=False, parallel_iterations=512,
swap_memory = True)
self.logits = decoder_outputs.rnn_output
self.loss = loss.sequence_loss(self.logits, tf.transpose(tf.stack(targets), [1,0], name="targets_"),
tf.transpose(tf.stack(self.target_weights), [1,0], name="weights_"),
softmax_loss_function = softmax_loss_function)
ipdb> tf.version '1.1.0-rc0'
python: 2.7
It's a trouble with tf.contrib.seq2seq.loss.sequence_loss, for sure.
If you use dynamic RNNs and don't unroll your BPTT manually, you may use much simplier loss function.
What I did, is basically:
loss = tf.reduce_sum(
tf.nn.sparse_softmax_cross_entropy_with_logits(
labels=self.answers,
logits=presoftmax
)
)/self.batch_sz
I know, it's not purely scientific. You'll need to shape it for your task. It's just a hint.
I guess you are using GreedyEmbeddingHelper? During training, you should use TF's "TrainingHelper". The output dimension should match your target dimension because at ever time step, the target is used as your input.