I am trying to adapt a transformers code from https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/transformer.ipynb and use it in action recognition. But I keep getting Resource exhausted: OOM when allocating. I have a RTX Titan which has 24gb. I find it very strange to be running into this kinda of problem. My dataset is composed of 1000 actions with variable length frame (N) where each frame contains 84 float32 points (x, y). I combine N and points to form a fairly big 1d tensor for every action. Only 1 action has tensor length 40K all the others are around 10K, 20K etc.
My batch = 1, num_layers = 2, d_model = 32, dff = 64, num_heads = 2.
A couple of batches are able to run before giving me this error.
EPOCH: 0
tf.Tensor([[336.655 324.18 310.146 ... 252.652 260.521 260.083]],
shape=(1, 15960), dtype=float32) tf.Tensor([[783 499 572 19 784]],
shape=(1, 5), dtype=int64)
Epoch 1 Batch 0 Loss 6.3220 Accuracy 0.2500
tf.Tensor([[323.237 320.201 310.713 ... 223.767 226.396 226.396]],
shape=(1, 13020), dtype=float32) tf.Tensor([[783 42 50 784]],
shape=(1, 4), dtype=int64)
Epoch 1 Batch 1 Loss 6.2927 Accuracy 0.2917
tf.Tensor([[343.387 331.561 316.581 ... 263.883 260.453 255.308]],
shape=(1, 12096), dtype=float32) tf.Tensor([[783 46 784]], shape=(1,
3), dtype=int64)
Epoch 1 Batch 2 Loss 6.1787 Accuracy 0.3611
tf.Tensor([[320.014 317.322 306.94 ... 219.537 220.311 221.472]],
shape=(1, 10332), dtype=float32) tf.Tensor([[783 334 784]], shape=(1,
3), dtype=int64)
Epoch 1 Batch 3 Loss 6.1479 Accuracy 0.3958
tf.Tensor([[224.648 218.128 208.176 ... 188.814 191.243 196.101]],
shape=(1, 27300), dtype=float32) tf.Tensor([[783 350 784]], shape=(1,
3), dtype=int64)
Below is the error I am getting:
2 root error(s) found. (0) Resource exhausted: OOM when allocating
tensor with shape[1,2,31332,31332] and type float on
/job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node
transformer_1/encoder_2/encoder_layer_5/multi_head_attention_16/Softmax
(defined at :32) ]] Hint: If you want
to see a list of allocated tensors when OOM happens, add
report_tensor_allocations_upon_oom to RunOptions for current
allocation info.
[[gradient_tape/transformer_1/encoder_2/embedding_4/embedding_lookup/Reshape/_280]]
Hint: If you want to see a list of allocated tensors when OOM happens,
add report_tensor_allocations_upon_oom to RunOptions for current
allocation info.
(1) Resource exhausted: OOM when allocating tensor with
shape[1,2,31332,31332] and type float on
/job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node
transformer_1/encoder_2/encoder_layer_5/multi_head_attention_16/Softmax
(defined at :32) ]] Hint: If you want
to see a list of allocated tensors when OOM happens, add
report_tensor_allocations_upon_oom to RunOptions for current
allocation info.
0 successful operations. 0 derived errors ignored.
[Op:__inference_train_step_20907]
Errors may have originated from an input operation. Input Source
operations connected to node
transformer_1/encoder_2/encoder_layer_5/multi_head_attention_16/Softmax:
transformer_1/encoder_2/encoder_layer_5/multi_head_attention_16/add
(defined at :28)
Input Source operations connected to node
transformer_1/encoder_2/encoder_layer_5/multi_head_attention_16/Softmax:
transformer_1/encoder_2/encoder_layer_5/multi_head_attention_16/add
(defined at :28)
Function call stack: train_step -> train_step
::UPDATED PROBLEM::
So I reduced my tensor input shape. I was able to run it without resource exhaustion BUT I am running into another problem. I keep getting:
2 root error(s) found.
(0) Invalid argument: indices[0,1923] = -1
is not in [0, 12936) [[node
transformer_1/encoder_2/embedding_4/embedding_lookup (defined at
:24) ]] (1) Invalid argument:
indices[0,1923] = -1 is not in [0, 12936) [[node
transformer_1/encoder_2/embedding_4/embedding_lookup (defined at
:24) ]]
[[transformer_1/encoder_2/embedding_4/embedding_lookup/_24]] 0
successful operations. 0 derived errors ignored.
[Op:__inference_train_step_17044]
Function call stack: train_step -> train_step
if my tensor inputs are less than 3000 elements, I can run successfully run it but higher I get the above error. Has anyone run into this kinda of problem ? I have no idea what error means or how to fix it :(
any help again is appreciated
Your sequence is too long: in transformer every element is attended to each other. As a result - you have to allocate a huge tensor ~8GB. Transformer is too complicated and needs a lot of memory for running. Looks like 24GB is not enough. Try less sequence length.
Related
Thanks for your reading.
I train a LSTM predictor with fixed dimension (None, 5, 2), and I test the predictor with smaller dimension (None, 1, 2), and I got the warning:
WARNING:tensorflow:Model was constructed with shape (None, 5, 2) for input Tensor("input_1_1:0", shape=(None, 5, 2), dtype=float32), but it was called on an input with incompatible shape (None, 1, 2).
However, the results are fine.
I just wonder what tensorflow actually do when the case happens? Say it will automatically pad zero, such that to match the dimensions?
Again, thanks for your reading and looking forward to an answer.
Tensor computations are executed as a TensorFlow graph - see https://www.tensorflow.org/guide/intro_to_graphs. Normally graph execution is faster.
The second dimension of LSTM is dynamic. In such cases keras have to rebuild the graph every time when the input shape changing. It is slow. If your input shape is changing frequently - graph execution could be slower than eager execution. Because of that - keras issuing a warning.
Keras don't pad your data.
The new tf.keras.preprocessing.timeseries_dataset_from_array function is used to create sliding minibatch windows over the sequential data, for example for tasks involving rnn networks.
According to the docs it returns a minibatch of inputs and targets. However, the target minibatch this function returns does not have a sequence_length (timesteps) dimension. For example.
data = timeseries_dataset_from_array(
data=tokens,
targets=targets,
sequence_length=25,
batch_size=32,
)
for minbatch in data:
inputs, targets = minbatch
assert(inputs.shape[1] == targets.shape[1]) # error
The inputs have [32, 25, 1] shape in case you already just have word indices there and targets confusingly have [32, 1] shape.
So, my question is how am I supposed to map a tensor of inputs with a window of 25 units to a target tensor with a window of 0 units?
How I always train sequence models is by feeding the input tensor of [32, 25, 1] which is then projected into [32, 25, 100] and then you feed the target tensor to the network of size [32, 25, 1] to your loss function or if you have multi-class problem a target vector of [32, 25, num_of_classes].
That is why I am confused by the shape of the target tensor from timeseries_dataset_from_array and the intuition behind it.
I am studying 1d convolution using tensorflow.
Code:
import numpy as np
import tensorflow as tf
\#####raw data, input length is 24, and feature_len is 6
batch = np.ceil((np.random.rand(24, 6)*10))-5
\#####filter for convoltion, filter width is 3, filter input dim is 6, output dim is 18
eye_filter = tf.constant(np.eye(3*6).reshape(3,6,18).reshape(3,6,18))
\#####here error happened
conv = tf.nn.conv1d(input=batch, filters=eye_filter, stride=1, padding='SAME')
Error Message:
InvalidArgumentError Traceback (most recent call
last)
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py
in _create_c_op(graph, node_def, inputs, control_inputs) 1606
try:
-> 1607 c_op = c_api.TF_FinishOperation(op_desc) 1608 except errors.InvalidArgumentError as e:
InvalidArgumentError: Shape must be rank 4 but is rank 3 for
'conv1d_1' (op: 'Conv2D') with input shapes: [24,1,6], [1,3,6,18].
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call
last) 10 frames
/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py
in _create_c_op(graph, node_def, inputs, control_inputs) 1608
except errors.InvalidArgumentError as e: 1609 # Convert to
ValueError for backwards compatibility.
-> 1610 raise ValueError(str(e)) 1611 1612 return c_op
ValueError: Shape must be rank 4 but is rank 3 for 'conv1d_1' (op:
'Conv2D') with input shapes: [24,1,6], [1,3,6,18].
Why is filter rank 4, when I reshaped it to 3?
Why is op name is Conv2D, when I did conv1d?
How can I see the convolution result of above two tensor(raw data and filter)?
It's expecting your input tensor to be "Rank 4" meaning it has 4 dimensions, but you've technically given a 2d array.
Technically, Conv1d uses Conv2d as you noticed, according to this API documentation:
conv1d api doc
Your array for input data has a length of 24, and 6 channels for the features.
The TF convolution functions can operate on an array of inputs.
This means your data also has to have an index for which element out of the batch of inputs you want to select. I'm guessing from your example that you want to pass it just one input. To fix this, you need to reshape your tensor to have this extra dimension, but be length 1.
Really, conv1d only needs your input to be rank 3, but it transparently inserts a new dimension of length 1 so it's 2d (Imagine a monitor with resolution 1920x1. Technically 2d, but only 1 pixel high). Then it passes that to conv2d
Instead of keeping the data as a np array, use this function and then reshape it to be [Nth item (length 1)][Width (length 24)][Channel (length 6)]
Here's how I would rewrite your code:
import numpy as np
import tensorflow as tf
#####raw data, input length is 24, and feature_len is 6
batch = np.ceil((np.random.rand(24, 6)*10))-5
batch = tf.convert_to_tensor(batch, dtype=int32)
batch = tf.reshape(batch, shape=[1, 24, 6], dtype=int32)
#####filter for convoltion, filter width is 3, filter input dim is 6, output dim is 18
eye_filter = tf.constant(np.eye(3*6).reshape(3,6,18).reshape(3,6,18))
#####here error happened
# I added the optional data_format parameter
conv = tf.nn.conv1d(input=batch, data_format='NWC', filters=eye_filter, stride=1, padding='SAME')
I chose that specific shape ordering from the conv1d api doc about the data_format parameter having a default of "NWC" or Nth_item Width Channels. In conv2d, it has "NCHW" or similar. I would make sure you understand how that works so in the future you don't get weird results from an array that's shaped a way you didn't expect.
If you want to see the tensor output, you need to either make a graph and run it in a session. Or you can turn on eager execution.
sess = tf.Session()
print(sess.run(conv))
sess.close()
eager execution
Generally, you would use a session for speed with large computations, and use eager execution for debugging, learning, or verifying data is getting imported correctly.
I am doing transfer learning using a pre-trained inception-resnet-v2 model. From one of the conv layers I am extracting the best activation (best quality) to calculate the predicted landmarks using opencv and numpy operations. The loss function I am applying is the mean_squared_error loss. Unfortunately, when I am executing this function I get an error message that no gradients are available for any of the variables. I am struggling with this problem since two weeks and I don't know how to proceed. While debugging I could see that the problem occurred when the apply_gradients function gets executed internally. I have searched and used some solutions from here like this ones:
ValueError: No gradients provided for any variable in Tensorflow
selecting trainable variables to compute gradient "No variables to optimize"
Tensorflow: How to replace or modify gradient?
...
In addition, I have tried to write my own operation with gradient support, using this awesome tutorial: https://code-examples.net/en/q/253d718, because this solution wraps my python and opencv code in tensorflow. Unfortunately, the issue still remains. Tracing the path from the output of the network to the mean_squared_error function using TensorBoard, I could see that the path is available and continuously, too.
# Extracts the best predicted images from a specific activation
layer
# PYTHON function: get_best_images(...) -> uses numpy and opencv
# PYTHON function: extract_landmarks(...) -> uses numpy
# Endpoints is the conv layer that gets extracted
best_predicted = tf.py_func(get_best_images, [input,
end_points['Conv2d_1a_3x3']], tf.uint8) # Gets best activation
best_predicted.set_shape(input.shape)
# Gets the predicted landmarks and processes both target and
predicted for further calculation
proc_landmarks = tf.py_func(get_landmarks, [best_predicted,
target_landmarks], [tf.int32, tf.int32])
proc_landmarks[0].set_shape(target_landmarks.shape)
# target landmarks
proc_landmarks[1].set_shape(target_landmarks.shape)
# predicted landmarks
# --> HERE COMES THE COMPUTATION TO PROCESS THE TARGET AND PREDICTED
LANDMARKS
# Flattens and reshapes the tensors to 1D (68,1)
target_flatten = tf.reshape(target_result[0], [-1])
target_flatten = tf.reshape(target_flatten, [68,1])
predicted_flatten = tf.reshape(predicted_result[1], [-1])
predicted_flatten = tf.reshape(predicted_flatten, [68,1])
edit_target_landmarks = tf.cast(target_flatten, dtype=tf.float32)
edit_predicted_landmarks = tf.cast(predicted_flatten,
dtype=tf.float32)
# Calculating the loss
mse_loss =
tf.losses.mean_squared_error(labels=edit_target_landmarks,
predictions=edit_predicted_landmarks)
optimizer = tf.train.AdamOptimizer(learning_rate=0.001,
name='ADAM_OPT').minimize(mse_loss) # <-- here does the error occur
The error message is this one (for short only some variables get listed):
ValueError: No gradients provided for any variable, check your graph >for ops that do not support gradients, between variables ["'InceptionResnetV2/Conv2d_1a_3x3/weights:0' shape=(3, 3, 3, 32) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_1a_3x3/BatchNorm/beta:0' shape=(32,) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_2a_3x3/weights:0' shape=(3, 3, 32, 32) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_2a_3x3/BatchNorm/beta:0' shape=(32,) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_2b_3x3/weights:0' shape=(3, 3, 32, 64) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_2b_3x3/BatchNorm/beta:0' shape=(64,) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_3b_1x1/weights:0' shape=(1, 1, 64, 80) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_3b_1x1/BatchNorm/beta:0' shape=(80,) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_4a_3x3/weights:0' shape=(3, 3, 80, 192) >dtype=float32_ref>", "'InceptionResnetV2/Conv2d_4a_3x3/BatchNorm/beta:0' shape=(192,) >dtype=float32_ref>", "'InceptionResnetV2/Mixed_5b/Branch_0/Conv2d_1x1/weights:0' shape=(1, 1, >192, 96) dtype=float32_ref>", "
EDIT:
I have managed to compute the gradients for the first two variables of the train list using this guide Override Tensorflow Backward-Propagation. Based on that I forgot the third parameter (which is mentioned as the d parameter in the guide) in the forward and backward propagation function which is in my case the conv layer output of the net. Nevertheless, I am getting only the first two gradients computed and all the others are missing. Do I have to compute and return in the backpropagation function for every trainable variable the gradient?. When I am right in the backpropagation function we are computing the derivatives with respect to the ops inputs, which are in my case 2 variables (target and predicted) and one for the conv layer output (i.e. return grad * op.inputs[0], grad * op.inputs[1], grad * op.inputs[2]). I thought that the overall computation for all trainable variables gets done after defining the custom gradient computation and while applying the opt.compute_gradient function using as a second parameter the variable list. Am I right or wrong?.
I have posted the part of the TensorBoard output for the mean_squared_error op. The image shows the additional loss function which I had left out to simplify my problem. This loss function works well. The arrow from the mean_squared_error function to the gradient computation is missing, because of the issue. I hope this gives a better overview.
I'm trying to use batch normalization in a conv2d_transpose as follows:
h1 = tf.layers.conv2d_transpose(inputs, 64, 4, 2, padding='SAME',
kernel_initializer=tf.variance_scaling_initializer,
bias_initializer=tf.ones_initializer,
activity_regularizer=tf.layers.batch_normalization,
)
h2 = tf.layers.conv2d_transpose(h1, 3, 4, 2, padding='SAME',
kernel_initializer=tf.variance_scaling_initializer,
bias_initializer=tf.ones_initializer,
activity_regularizer=tf.layers.batch_normalization,
)
And I am receiving the following error:
ValueError: Dimension 1 in both shapes must be equal, but are 32 and 64
From merging shape 2 with other shapes. for 'tower0/AddN' (op: 'AddN') with input shapes: [?,32,32,64], [?,64,64,3].
I've seen that other people have had this error in Keras because of the difference in dimension ordering between TensorFlow and Theano. However, I'm using pure TensorFlow, all of my variables are in TensorFlow dimension format (batch_size, height, width, channels), and the data_format of the conv2d_transpose layer should be the default 'channels_last'. What am I missing here?
tf.layers.batch_normalization should be added as a layer, not a regularizer. activity_regularizer is a function that takes activity (layer's output) and produces an extra loss term that is added to the overall loss term of the whole network. For example, you might want to penalize networks that produce high activation. You can see how activity_regularizer is called on the outputs and its result added to the loss here.