How to set specific gpu in bert? - tensorflow

ResourceExhaustedError (see above for traceback):
OOM when allocating tensor of shape [768] and type float [[node
bert/encoder/layer_0/attention/output/LayerNorm/beta/adam_m/Initializer/zeros
(defined at /home/zyl/souhu/bert/optimization.py:122) =
Const_class=["loc:#bert/encoder/layer_0/attention/output/LayerNorm/beta/adam_m/Assign"],
dtype=DT_FLOAT, value=Tensor, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
How to set gpu 1 or another to run bert?

The easiest way to set what GPUs will be used is setting CUDA_VISIBLE_DEVICES environment variable. It will still be GPU:0 TensorFlow, different physically different device.
If you are using BERT within Python (which is rather a painful way), you can use the code that is creating BERT graph in a block:
with tf.device('/device:GPU:1'):
model = modeling.BertModel(...)

Related

Tensorflow not using multiple GPUs - getting OOM

I'm running into OOM on a multi-gpu machine, because TF 2.3 seems to be allocating a tensor using only one GPU.
tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at conv_ops.cc:539 :
Resource exhausted: OOM when allocating tensor with shape[20532,64,48,32]
and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc.
But tensorflow does recognize multiple GPUs when I run my code:
Adding visible gpu devices: 0, 1, 2
Is there anything else I need to do to have TF use all GPUs?
The direct answer is yes, you do need to do more to get TF to recognize multiple GPUs. You should refer to this guide but the tldr is
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
...
https://www.tensorflow.org/guide/distributed_training#using_tfdistributestrategy_with_tfkerasmodelfit
But in your case, something else is happening. While this one tensor may be triggering the OOM, it's likely because a few previous large tensors were allocated.
The first dimension, your batch size, is 20532, which is really big. Since the factorization of that is 2**2 × 3 × 29 × 59, I'm going to guess you are working with CHW format and your source image was 3x64x128 which got trimmed after a few convolutions. I'd suspect an inadvertent broadcast. Print a model.summary() and then review the sizes of the tensors coming out of each layer. You may also need to look at your batching.

OOM with tensorflow

I'm facing an OOM error whole training my tensorflow model, the structure is as follows:
tf.contrib.layers.embed_sequence initialized with GoogleNewsVector
2 * tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell) #forward
2 * tf.nn.rnn_cell.DropoutWrapper(tf.nn.rnn_cell.LSTMCell) #backward
tf.nn.bidirectional_dynamic_rnn wrapping the above layers
tf.layers.dense as an output layer
i tried to reduce the batch size down to as low as 64, my input data is padded to 1500, and my vocab size is 8938
The cluster i'm using is very powerful (https://wiki.calculquebec.ca/w/Helios/en) i'm using two nodes with 8 GPUs each and still getting this error:
2019-02-23 02:55:16.366766: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at reverse_op.cc:270 : Resource exhausted: OOM when
allocating tensor with shape[2000,800,300] and type float on
/job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
I'm using the estimator API with MirroredStrategy and still no use, is there a way maybe to ask tensorflow to just run the training using the GPUs and keep the tensors stores on the main machine memory? Any other suggestions are welcome.
Running a particular operation (e.g. some tensor multiplication during training) using GPU requires having those tensors stored on the GPU.
You might want to use Tensorboard or something like that to see which operations require the most memory for your calculation graph. In particular, it's possible that first link between the embeddings and LSTM is the culprit and you'd need to narrow that somehow.

Mask R-CNN for TPU on Google Colab

We are trying to build an image segmentation deep learning model using Google Colab TPU. Our model is Mask R-CNN.
TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
import tensorflow as tf
tpu_model = tf.contrib.tpu.keras_to_tpu_model(
model.keras_model,
strategy=tf.contrib.tpu.TPUDistributionStrategy(
tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)))
However I am running into issues while converting our Mask R-CNN model to TPU model as pasted below.
ValueError:
Layer <keras.engine.topology.InputLayer object at 0x7f58574f1940> has a
variable shape in a non-batch dimension. TPU models must
have constant shapes for all operations.
You may have to specify `input_length` for RNN/TimeDistributed layers.
Layer: <keras.engine.topology.InputLayer object at 0x7f58574f1940>
Input shape: (None, None, None, 3)
Output shape: (None, None, None, 3)
Appreciate any help.
Google recently released a tutorial on getting Mask R-CNN going on their TPUs. For this, they are using an experimental model for Mask RCNN on Google's TPU github repository (under models/experimental/mask_rcnn). Looking through the code, it looks like they define the model with a fixed input size to overcome the issue you are seeing.
See below for more explanation:
As #aman2930 points out, the shape of your input tensor is not static. This won't work because Tensorflow compiles models with XLA to use a TPU and XLA must have all tensor shapes defined at compile time. In the link above, the website specifically calls this out:
Static shapes
During regular usage TensorFlow attempts to determine the shapes of each tf.Tensor during graph construction. During
execution any unknown shape dimensions are determined dynamically, see
Tensor Shapes for more details.
To run on Cloud TPUs TensorFlow models are compiled using XLA. XLA
uses a similar system for determining shapes at compile time. XLA
requires that all tensor dimensions be statically defined at compile
time. All shapes must evaluate to a constant, and not depend on
external data, or stateful operations like variables or a random
number generator.
That side, further down the document, they mention that the input function is run on the CPU, so isn't limited by static XLA sizes. They point to batch size being the issue, not image size:
Static shapes and batch size
The input pipeline generated by your
input_fn is run on CPU. So it is mostly free from the strict static
shape requirements imposed by the XLA/TPU environment. The one
requirement is that the batches of data fed from your input pipeline
to the TPU have a static shape, as determined by the standard
TensorFlow shape inference algorithm. Intermediate tensors are free to
have a dynamic shapes. If shape inference has failed, but the shape is
known it is possible to impose the correct shape using tf.set_shape().
So you could fix this by reformulating your model to have fixed batch size or to use
tf.contrib.data.batch_and_drop_remainder as they suggest.
Could you please share the input data function. It is hard to tell the exact issue, but it seems that the shape of tensor representing input sample is not static.

tensorflow summary ops can assign to gpu

Here is part of my code.
with tf.Graph().as_default(), tf.device('/cpu:0'):
global_step = tf.get_variable(
'global_step',
[],
initializer = tf.constant_initializer(0),
writer = tf.summary.FileWriter(logs_path,graph=tf.get_default_graph())
with tf.device('/gpu:0'):
tf.summary.scalar('learning_rate', INITIAL_LEARNING_RATE)
summary_op = tf.summary.merge_all()
when I run it. I will get following error:
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'learning_rate': Could not satisfy explicit device specification '/device:GPU:0' because no
supported kernel for GPU devices is available.
[[Node: learning_rate = ScalarSummary[T=DT_FLOAT, _device="/device:GPU:0"](learning_rate/tags, learning_rate/values)]]
if I move these 2 ops into tf.device("/cpu:0") device scope, It will work.
tf.summary.scalar('learning_rate', INITIAL_LEARNING_RATE)
summary_op = tf.summary.merge_all()
I google it. there are many suggestiones about using "allow_soft_placement=True". But I think this solution is basically change device scope automatically. So my question is:
why these 2 ops can not assign to gpu? Is there any documents I can look at to figure out what ops can or cannot assign to gpu?
any suggestion is welcome.
You can't assign a summary operation to a GPU because is meaningless.
In short, a GPU executes parallel operations. A summary is nothing but a file in which you append new lines every time you write on it. It's a sequential operation that has nothing in common with the operation that GPUs are capable to do.
Your error says it all:
Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
That operation (in the tensorflow version you're using) has no GPU implementation and thus must be sent to a CPU device.

RNN model running out of memory in TensorFlow

I implemented a Sequence to Sequence model using the rnn.rnn helper in TensorFlow.
with tf.variable_scope("rnn") as scope, tf.device("/gpu:0"):
cell = tf.nn.rnn_cell.BasicLSTMCell(4096)
lstm = tf.nn.rnn_cell.MultiRNNCell([cell] * 2)
_, cell = rnn.rnn(lstm, input_vectors, dtype=tf.float32)
tf.get_variable_scope().reuse_variables()
lstm_outputs, _ = rnn.rnn(lstm, output_vectors, initial_state=cell)
The model is running out of memory on a Titan X with 16 GB of memory while allocating gradients for the LSTM cells:
W tensorflow/core/kernels/matmul_op.cc:158] Resource exhausted: OOM when allocating tensor with shape[8192,16384]
W tensorflow/core/common_runtime/executor.cc:1102] 0x2b42f00 Compute status: Resource exhausted: OOM when allocating tensor with shape[8192,16384]
[[Node: gradients/rnn/RNN/MultiRNNCell_1/Cell0/BasicLSTMCell/Linear/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/RNN/MultiRNNCell_1/Cell0/BasicLSTMCell/Linear/concat, gradients/rnn/RNN/MultiRNNCell_1/Cell0/BasicLSTMCell/add_grad/tuple/control_dependency)]]
If I reduce the length of the input and output sequences to 4 or less the model runs without a problem.
This indicates to me that TF is trying to allocate the gradients for all time steps at the same time. Is there a way of avoiding this?
The function tf.gradients as well as the minimize method of the optimizers allow you to set parameter called aggregation_method. The default value is ADD_N. This method constructs the graph in such a way that all gradients need to be computed at the same time.
There are two other undocumented methods called tf.AggregationMethod.EXPERIMENTAL_TREE and tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N, which do not have this requirement.