This is my error:
OOM when allocating tensor of shape [7,7,512,4096] and type float
[[Node: W6/Adam/Initializer/zeros = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [7,7,512,4096] values: [[[0 0 0]]]...>, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]
In the traceback I can see that it's caused by this line:
sess.run(tf.global_variables_initializer())
All parameters combined use around 1,5Gb of memory. I have 4 Gb of memory available.
I have already tried without success:
config.gpu_options.allocator_type = 'BFC'
config.gpu_options.per_process_gpu_memory_fraction = 0.40
config.gpu_options.allow_growth = True
How can I fix it?
EDIT:
How did I calculate the amount of used memory?
var_sizes = [np.product(list(map(int, v.shape))) * v.dtype.size
for v in tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES)]
print(sum(var_sizes) / (1024 ** 2), 'MB')
In your calculation, you compute the amount of memory necessary to hold your variables. However this is only a fraction of the memory you need. You are missing in particular:
The neuron outputs (i.e. the features).
The gradient of your cost function with respect to your model parameters and the features.
Tensorflow does try to optimize memory when possible, but that gives you a ballpark estimate. So I would not be surprised if you need more than 4Gb in total after all.
Related
I'm running into OOM on a multi-gpu machine, because TF 2.3 seems to be allocating a tensor using only one GPU.
tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at conv_ops.cc:539 :
Resource exhausted: OOM when allocating tensor with shape[20532,64,48,32]
and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc.
But tensorflow does recognize multiple GPUs when I run my code:
Adding visible gpu devices: 0, 1, 2
Is there anything else I need to do to have TF use all GPUs?
The direct answer is yes, you do need to do more to get TF to recognize multiple GPUs. You should refer to this guide but the tldr is
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
...
https://www.tensorflow.org/guide/distributed_training#using_tfdistributestrategy_with_tfkerasmodelfit
But in your case, something else is happening. While this one tensor may be triggering the OOM, it's likely because a few previous large tensors were allocated.
The first dimension, your batch size, is 20532, which is really big. Since the factorization of that is 2**2 × 3 × 29 × 59, I'm going to guess you are working with CHW format and your source image was 3x64x128 which got trimmed after a few convolutions. I'd suspect an inadvertent broadcast. Print a model.summary() and then review the sizes of the tensors coming out of each layer. You may also need to look at your batching.
I am using Glove pre-trained embedding to train my own network. I use
self.embedding = tf.get_variable(name="embedding", shape=self.id2vec_table.shape, initializer=tf.constant_initializer(self.id2vec_table), trainable=False)
and tuning_embedding = tf.nn.embedding_lookup(self.embedding, self.txt_from_mfcc)
to initialize and look up embedding. However when I did the training, the error shows as (the error message is too long, I add here the most important ones I believe)
Sum Total of in-use chunks: 3.85GiB, Limit:
11281927373 InUse: 4131524096 MaxInUse:
6826330624 NumAllocs: 47061 MaxAllocSize:
2842165248 OP_REQUIRES failed at matmul_op.cc:478 : Resource
exhausted: OOM when allocating tensor with shape[4800,400001] and type
float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator
GPU_0_bfc
However, as from the error STATS, my max memory of tesla k80 is 11G, and here it is only used for 40%-70% - around 4 ~7 G, how can my gpu be out of memory since it only uses up at most 70% of the total max memory? I just cannot understand the inner mechanism of how it works.
I also has tried methods from other post such as from
https://stackoverflow.com/questions/42495930/tensorflow-oom-on-gpu
and limit my batch size to 16 or config.gpu_options.allow_growth = True or config.gpu_options.allocator_type = 'BFC' or config.gpu_options.per_process_gpu_memory_fraction = 0.4, the error is still here.
Any help here ?
I have a Tensorflow model with is a recurrent neural network using long short term memory. The state size is 3000, each time step of input has 300 inputs, there are about 500 time steps, and 1 output for each time step. I am training a sequence-to-sequence model.
It runs fine for inputs with less than 500 time steps, but somewhere around 500 timesteps, it crashes with the following out of memory error:
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[20375,20375]
[[Node: gradients/mean_squared_error/Mul_grad/mul_1 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](mean_squared_error/Square, gradients/mean_squared_error/Sum_grad/Tile)]]
[[Node: gradients/MatMul_grad/tuple/control_dependency_1/_225 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_5086_gradients/MatMul_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
And this is running on a GPU with 12gb of memory.
I have tried running it on my laptop cpu, and it seems to use very little memory (about 1 to 2 gb), but it's so slow that it never did get to 500 time steps. I'm working on some changes that will make it skip to 500 time steps to see how much memory it uses when not running on a GPU.
My questions is: Where could Tensorflow possibly want to allocate a tensor of shape [20375, 20375]? It seems to be related to the tf.mean_squared_error function, but that doesn't seem like an operation that should require such exorbitant amounts of memory.
I have tried reducing the batch size, but that just pushes the failure point up to a few more time steps, and I'll need up to a few thousand time steps, so this doesn't seem like a good long-term solution. I'd prefer to get to the root of the problem.
Here is the relevant code for the mean squared error:
initial_state_tuple = tf.contrib.rnn.LSTMStateTuple(initial_state, initial_hidden_state)
# Create the actual RNN
with tf.variable_scope(VARIABLE_SCOPE, reuse=None):
cell = tf.contrib.rnn.BasicLSTMCell(STATE_SIZE)
rnn_outputs, finalstate = tf.nn.dynamic_rnn(cell=cell, inputs=networkinput,
initial_state=initial_state_tuple)
with tf.variable_scope(VARIABLE_SCOPE, reuse=True):
weights = tf.get_variable(name=WEIGHTS_NAME, shape=[STATE_SIZE, 1], dtype=tf.float32)
biases = tf.get_variable(name=BIASES_NAME, shape=[1], dtype=tf.float32)
# Build the output layers
rnn_outputs_reshaped = tf.reshape(rnn_outputs, [-1, STATE_SIZE])
network_outputs = tf.sigmoid(tf.matmul(rnn_outputs_reshaped, weights) + biases)
expected_outputs_reshaped = tf.reshape(expected_outputs, [-1, 1])
# Loss mask just cancels out the inputs that are padding characters, since not all inputs have the same number of time steps
loss_mask_reshaped = tf.reshape(loss_mask, shape=[-1])
expected_outputs_reshaped = loss_mask_reshaped * expected_outputs_reshaped
network_outputs = loss_mask_reshaped * network_outputs
loss = tf.losses.mean_squared_error(labels=expected_outputs_reshaped, predictions=network_outputs)
If you want all of the code, it can be found here. The relevant functions are buildtower() and buildgraph(). The constants NUM_GPUS and BATCH_SIZE are set to appropriate values when running on the machine with the GPUs.
Update: I replaced the line
loss = tf.losses.mean_squared_error(labels=expected_outputs_reshaped, predictions=network_outputs)
with
error_squared = tf.pow(expected_outputs_reshaped - network_outputs, 2)
loss = tf.reduce_mean(error_squared)
and the same error happened. I reduced the state size to 30 and the batch size to 5, and the error still happened, although it did make it up to about 3000 time steps.
Update: After doing some research, I have found that, when training an RNN with a large number of time steps, truncated backpropagation is often used. This leads me to believe that backpropagation through a large number of time steps inherently takes a lot of memory, and my issue is not that I've constructed my graph wrong, but that I have a fundamental misunderstanding of the resource requirements of gradient calculations. To this end, I am working on changing my code to use truncated backpropagation. I will report back with results.
This project is my first experience with machine learning and Tensorflow, and after doing some research, it seems I had some fundamental misunderstandings.
I had thought that memory usage would scale linearly with the number of time steps in my data. Because every other dimension of my model (Batch size, state size) was small, I expected that I could get up to quite a few time steps before running out of memory. However, it seems that memory usage of computing the gradients scales exponentially with the number of time steps, so no matter how small I made the state size and batch size, it would eventually exhaust all my memory because of the large number of time steps.
To deal with this, I am using truncated backpropagation, in which each batch is broken up into chunks of some fixed number of time steps. This is not perfect, because it means that errors can only be propagated back at most this many time steps. However, based on what I've found online, it seems to work well enough, and there's not too many other ways to get around the memory usage issue.
As I said before, this is all my first experience with machine learning, so if anything in here is blatantly wrong, please tell me.
How do you convert a Tensorflow graph from using float32 to float16? Currently there are graph optimizations for quantization and conversion to eight bit ints.
Trying to load float32 weights into a float16 graph fails with:
DataLossError (see above for traceback): Invalid size in bundle entry: key model/conv5_1/biases; stored size 1536; expected size 768
[[Node: save/RestoreV2_16 = RestoreV2[dtypes=[DT_HALF], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_16/tensor_names, save/RestoreV2_16/shape_and_slices)]]
[[Node: save/RestoreV2_3/_39 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_107_save/RestoreV2_3", tensor_type=DT_HALF, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
I think my solution is definitely not the best and not the one which is the most straight forward, but as nobody else posted anything:
What I did was training the network with full precision and saved them in a checkpoint. Then I built a copy of the network setting all variables desired to a dtype of tf.float16 and removing all the training nodes. Finally, I loaded and casted the variables the following way:
previous_variables = [
var_name for var_name, _
in tf.contrib.framework.list_variables('path-to-checkpoint-file')]
#print(previous_variables)
sess.run(tf.global_variables_initializer())
restore_map = {}
for variable in tf.global_variables():
if variable.op.name in previous_variables:
var = tf.contrib.framework.load_variable(
'path-to-checkpoint-file', variable.op.name)
if(var.dtype == np.float32):
tf.add_to_collection('assignOps', variable.assign(
tf.cast(var, tf.float16)))
else:
tf.add_to_collection('assignOps', variable.assign(var))
sess.run(tf.get_collection('assignOps'))
This obviously has issues if there are tensors of float32 that you don't want to convert, which I luckily don't have as I want to convert all my nodes to float16 precision. In case you have those you could further filter with other if statements. I hope this answers your question.
I had this issue but I was loading a sub-graph which contained some variables that needed to be loaded or converted and some that not.
Based on #Jendrik, here is a function that returns the assign operation, given a dictionary that maps the saved variables to the new graph:
def assign_and_convert_halfPrecision(restore_dictinary, CHECKPOINT_PATH):
# Iterate over the dictionary containing the variables to load
for variable_name_old, varible_new in restore_dictinary.items():
# Load the variable from the checkpoint
var = tf.contrib.framework.load_variable(CHECKPOINT_PATH, variable_name_old)
# Assign to new graph
if(var.dtype == np.float32) and (varible_new.dtype == np.float16):
# If the variable is float16 in the new graph, we cast it
tf.add_to_collection('assignOps', varible_new.assign(tf.cast(var, tf.float16)))
else:
# If the variable in the old graph is float16 or the new variable is float32,
# we load it directly
tf.add_to_collection('assignOps', varible_new.assign(var))
# Return the operation
return tf.get_collection('assignOps')
To use it, just do:
# Create a trivial dictionary (all custom loading can be added here, like change of scope names)
restore_dictionary = dict()
for a in tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=''):
restore_dictionary[a.name[:-2]] = a
# Create the assignment and conversion op
assign_operation = assign_and_convert_halfPrecision(restore_dictionary, CHECKPOINT_PATH)
# Load
sess.run(assign_operation)
The loading can be controlled by modifying the dictionary, avoiding variables that should not be loaded or changing the scope of the variables to load.
I implemented a Sequence to Sequence model using the rnn.rnn helper in TensorFlow.
with tf.variable_scope("rnn") as scope, tf.device("/gpu:0"):
cell = tf.nn.rnn_cell.BasicLSTMCell(4096)
lstm = tf.nn.rnn_cell.MultiRNNCell([cell] * 2)
_, cell = rnn.rnn(lstm, input_vectors, dtype=tf.float32)
tf.get_variable_scope().reuse_variables()
lstm_outputs, _ = rnn.rnn(lstm, output_vectors, initial_state=cell)
The model is running out of memory on a Titan X with 16 GB of memory while allocating gradients for the LSTM cells:
W tensorflow/core/kernels/matmul_op.cc:158] Resource exhausted: OOM when allocating tensor with shape[8192,16384]
W tensorflow/core/common_runtime/executor.cc:1102] 0x2b42f00 Compute status: Resource exhausted: OOM when allocating tensor with shape[8192,16384]
[[Node: gradients/rnn/RNN/MultiRNNCell_1/Cell0/BasicLSTMCell/Linear/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/gpu:0"](rnn/RNN/MultiRNNCell_1/Cell0/BasicLSTMCell/Linear/concat, gradients/rnn/RNN/MultiRNNCell_1/Cell0/BasicLSTMCell/add_grad/tuple/control_dependency)]]
If I reduce the length of the input and output sequences to 4 or less the model runs without a problem.
This indicates to me that TF is trying to allocate the gradients for all time steps at the same time. Is there a way of avoiding this?
The function tf.gradients as well as the minimize method of the optimizers allow you to set parameter called aggregation_method. The default value is ADD_N. This method constructs the graph in such a way that all gradients need to be computed at the same time.
There are two other undocumented methods called tf.AggregationMethod.EXPERIMENTAL_TREE and tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N, which do not have this requirement.