TF2 add report_tensor_allocations_upon_oom to RunOptions - tensorflow

I'm getting this message:
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
How do I do that in Tensorflow 2.3?
Over the past few days this turned out to be a surprisingly frustrating issue. There appears to be no working example of how to do this in TF2.

This is still a long way from an allocated tensor list, but a start for TF2:
Tensorflow 2.4.1 contains the tf.config.experimental.get_memory_usage method, which returns the current number of bytes used on the GPU. Comparing this value across different points in time can shed some light on which tensors take up VRAM. It seems to be pretty accurate.
BTW, the latest nightly build contains the tf.config.experimental.get_memory_info method instead, seems they had a change of heart. This one contains the current, as well as the peak memory used.
Example code on TF 2.4.1:
import tensorflow as tf
print(tf.config.experimental.get_memory_usage("GPU:0")) # 0
tensor_1_mb = tf.zeros((1, 1024, 256), dtype=tf.float32)
print(tf.config.experimental.get_memory_usage("GPU:0")) # 1050112
tensor_2_mb = tf.zeros((2, 1024, 256), dtype=tf.float32)
print(tf.config.experimental.get_memory_usage("GPU:0")) # 3147264
tensor_1_mb = None
print(tf.config.experimental.get_memory_usage("GPU:0")) # 2098688
tensor_2_mb = None
print(tf.config.experimental.get_memory_usage("GPU:0")) # 1536

In order to load RunOptions you need to use RunMetadata. Both of them for TF2 can be found in the tf.compat.v1 package.
The following code works for keras 2.4.3 with a tensorflow 2.4.0 backend:
#Model build()
import tensorflow as tf
run_opts = tf.compat.v1.RunOptions(report_tensor_allocations_upon_oom=True)
runmeta = tf.compat.v1.RunMetadata()
keras_model.compile(optimizer=..., loss=..., options = run_opts, run_metadata=runmeta)
#Model fit()

Related

Memory leak with TPUs on GKE causing OOM/"Unavailable: Socket Closed" error

I am using preemptible v2.8 Google Cloud TPUs to perform large-scale hyperparameter optimization. I created the nodes using GKE with tensorflow 2.3 (the latest available version for Cloud TPUs.) Unfortunately, I keep encountering a memory leak on the TPU nodes during the search. This memory leak seems to ultimately cause an "Unavailable: Socket Closed" error (or sometimes an OOM error), where the TPU becomes unable to perform any additional training or evaluation even after re-deploying the code. The problem does not occur when I test my code on either a CPU or GPU.
This problem only occurs on the TPU worker node, but not the controller CPU. (At one point, I had been encountering another memory leak on the CPU due to a buildup of old models and unnecessary operations on the computation graph.) Methods such as tf.backend.clear_session() and del model resolved the memory leak with the CPU, but it persists on the TPU. Here is a graph of the TPU runtime memory usage (the decrease in memory at the end appears to occur after the TPU disconnects because GKE deletes it automatically):
Ultimately, as the used memory increases on the TPU, I get the following error:
raise_from tensorflow.python.framework.errors_impl.ResourceExhaustedError: 9 root error(s) found.
Error
2021-08-02T16:36:47.652282141ZHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Error
2021-08-02T16:36:47.652288611Z
Error
2021-08-02T16:36:47.652296423Z (4) Resource exhausted: {{function_node __inference_train_function_37854}} Attempting to reserve 3.27G at the bottom of memory. That was not possible. There are 3.48G free, 0B reserved, and 1.67G reservable.
Error
2021-08-02T16:36:47.652313550Z [[{{node cluster_train_function/_execute_4_0}}]]
2021-08-02T16:36:47.652921274Z0 successful operations.
Error
2021-08-02T16:36:47.654639274Z0 derived errors ignored.
Occasionally, I instead get an "Unavailable: Socket Closed" error or an "Unable to destroy remote tensor handles" error.
This error typically only occurs after training several networks. I tried multiple methods suggested by other posts to fix the error, such as typecasting my data to float32, not caching my dataset into memory, using a smaller mini batch size to decrease memory consumption, and using "from_logits=True" in my cost function. I even tried using multiprocessing to perform the network training so memory would be cleared after each network evaluation, but for some reason, the Cloud TPU fails to execute any of the for loops in my code or in the training code (a problem I did not have with either a GPU or CPU, cloud or otherwise.) Larger networks seem to cause the problem to occur much more quickly than smaller networks, which suggests to me that old, unused models are still kept in memory on the TPU. Is there any way to clear the memory on the TPU or reset its state to stop this memory leak?
Here is an MVE I wrote to duplicate the problem:
import os
import gc
import sys
import random
import numpy as np
import tensorflow as tf
from sklearn import metrics
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Conv2D, Flatten, Dense
from tensorflow.keras.optimizers import Adam
h = 128
w = 128
channels = 1
mini_batch_size = 256
epochs = 15
using_tpu = True
if using_tpu:
## Get tpu name from arguments
tpu_name = sys.argv[1]
tpu_name = tpu_name.replace('--tpu=', '')
## Initialize TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_name) # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.TPUStrategy(tpu)
def create_network():
strategy = tf.distribute.TPUStrategy(tpu)
with strategy.scope():
## Create random data
x_train = np.random.randn(1024, 128, 128, 1).astype('float32') # astype necessary to help prevent Connect to Socket Error
y_train = np.random.randn(1024, 50).astype('float32')
x_test = np.random.randn(256, 128, 128, 1).astype('float32')
y_test = np.random.randn(256, 50).astype('float32')
model = Sequential()
model.add(InputLayer((h, w, channels)))
layers = 5
ks = [np.random.choice([3, 5, 7]) for l in range(layers)]
filters = [np.random.choice([64, 128, 256]) for l in range(layers)]
for l in range(layers):
model.add(
Conv2D(kernel_size=(ks[l], ks[l]), padding='same',
filters=filters[l], name='conv' + str(l), activation='relu'))
model.add(Flatten())
# Softmax output layer
model.add(Dense(50)) # Don't need softmax activation because from_logits performs that operation automatically
lr = 0.001
opt = Adam(learning_rate=lr, decay=1e-6)
model.compile(optimizer=opt, loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
model.fit(x_train, y_train, epochs=epochs, batch_size=mini_batch_size, shuffle=True, verbose=1)
##### memory leak also occurs with dataset API:
'''
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(mini_batch_size,
drop_remainder=True)
model.fit(train_dataset, epochs=epochs, verbose=1, shuffle=shuffle,
steps_per_epoch=len(x_train) // mini_batch_size)
'''
#######
y_pred = model(x_test)
## Attempt to clear memory
print(gc.collect())
del model
tf.keras.backend.clear_session()
while True:
create_network()
Thank you so much! Please let me know if I should include any other information.
A few things:
Your error message:
(4) Resource exhausted: {{function_node __inference_train_function_37854}} Attempting to reserve 3.27G at the bottom of memory. That was not possible. There are 3.48G free, 0B reserved, and 1.67G reservable.
indicates an HBM OOM rather than a memory OOM. Basically the TPU has its own set of memory on the chips - in this case you've exhausted that memory. If it was an OOM (like RAM OOM), then you will likely see the SocketClosed error, which you saw as well.
That being said, what are your options? I suggest you go with the tf.data approach but with a few modifications:
def get_dataset(is_training: bool):
def generate_data(_):
return tf.random.normal([128, 128, 1], dtype=tf.bfloat16)
dataset = tf.data.Dataset.range(1)
dataset = dataset.repeat()
dataset = dataset.map(generate_data, num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.repeat().batch(mini_batch_size, drop_remainder=is_training)
train_dataset = get_dataset(is_training=True)
eval_dataset = get_dataset(is_training=False)
In this example we can use bfloat16 which reduces the memory footprint on HBM, but you may need to further reduce your minibatch size from 1024 to 512. Alternatively you can go up from v2-8 to v3-8 which has 2x the HBM. I'm not sure if the numpy based method contributes to the OOMs/SocketClosed errors you see, but I don't think this approach should run into that. Of course you'll eventually use real data, and in that case you should use tf.data for optimal performance. More info here.
IIUC tf.backend.clear_session() and gc.collect() will only clear memory on your host VM, not on the TPU server.
PS: You can also use the steps_per_execution flag to further improve the performance. Please see here for more info. Basically this prevents execution from continuously switching from CPU to TPU every step. If you set this to equal the number of training steps in an epoch, this will give you the best throughput.

Set batch size of trained keras model to 1

I am having a keras model trained on my own dataset. However after loading weights the summary shows None as the first dimension(the batch size).
I want to know the process to fix the shape to batch size of 1, as it is compulsory for me to fix it so i can convert the model to tflite with GPU support.
What worked for me was to specify batch size to the Input layer, like this:
input = layers.Input(shape=input_shape, batch_size=1, dtype='float32', name='images')
This then carried through the rest of the layers.
The bad news is that despite this "fix" the tfl runtime still complains about dynamic tensors. I get these non-fatal errors in logcat when it runs:
E/tflite: third_party/tensorflow/lite/core/subgraph.cc:801 tensor.data.raw != nullptr was not true.
E/tflite: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors (tensor#26 is a dynamic-sized tensor).
E/tflite: Ignoring failed application of the default TensorFlow Lite delegate indexed at 0.
The good news is that despite these errors it seems to be using the GPU anyway, based on performance testing.
I'm using:
tensorflow-lite-support:0.2.0'
tensorflow-lite-metadata:0.2.1'
tensorflow-lite:2.6.0'
tensorflow:tensorflow-lite-gpu:2.3.0'
Hopefully, they'll fix the runtime so it doesn't matter whether the batch size is 'None'. It shouldn't matter for doing inference.

Loaded keras model fails to continue training, dimensions mismatch

I'm using tensorflow with keras to train to a char-RNN using google colabs. I train my model for 10 epochs and save it, using 'model.save()' as shown in the documentation for saving models. Immediately after, I load it again just to check, I try to call model.fit() on the loaded model and I get a "Dimensions must be equal" error using the exact same training set. The training data is in a tensorflow dataset organised in batches as shown in the documentation for tf datasets. Here is a minimal working example:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
X = np.random.randint(0,50,(10000))
seq_len = 150
batch_size = 20
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset = dataset.batch(seq_len+1,drop_remainder=True)
dataset = dataset.map(lambda x: (x[:-1],x[1:]))
dataset = dataset.shuffle(20).batch(batch_size,drop_remainder=True)
def make_model(vocabulary_size,embedding_dimension,rnn_units,batch_size,stateful):
model = Sequential()
model.add(Embedding(vocabulary_size,embedding_dimension,
batch_input_shape=[batch_size,None]))
model.add(LSTM(rnn_units,return_sequences=True,stateful=stateful))
model.add(Dense(vocabulary_size))
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer='adam',metrics=['accuracy'])
model.summary()
return model
vocab_size = 51
emb_dim = 20
rnn_units = 10
model = make_model(vocab_size,emb_dim,rnn_units,batch_size,False)
model.fit(dataset,epochs=10)
model.save('/content/test_model')
model2 = tf.keras.models.load_model('/content/test_model')
model2.fit(dataset,epochs=10)
The first training line, "model.fit()", runs fine but the last line returns the error:
ValueError: Dimensions must be equal, but are 20 and 150 for '{{node
Equal}} = Equal[T=DT_INT64, incompatible_shape_error=true](ArgMax,
ArgMax_1)' with input shapes: [20], [20,150].
I want to be able to resume training later, as my real dataset is much larger. Therefore, saving only the weights is not an ideal option.
Any advice?
Thanks!
If you have saved checkpoints than, from those checkpoints, you can resume with reduced dataset. Your neural network / layers and dimensions should be same.
The problem is the 'accuracy' metric. For some reason, there is some mishandling of dimensions on the predictions when the model is loaded with this metric, as I found in this thread (see last comment). Running model.compile() on the loaded model with the same metric allows training to continue. However, it shouldn't be necessary to compile the model again. Moreover, this means that the optimiser state is lost, as explained in this answer, thus, this is not very useful for resuming training.
On the other hand, using 'sparse_categorical_accuracy' from the start works just fine. I am able to load the model and continue training without having to recompile. In hindsight, this choice is more appropriate given that the outputs of my last layer are logits over the distribution of characters. Thus, this is not a binary but a multiclass classification problem. Nonetheless, I verified that both 'accuracy' and 'sparse_categorical_accuracy' returned the same values in my specific example. Thus, I believe that keras is internally converting accuracy to categorical accuracy, but something goes wrong when doing this on a model that has been just loaded which forces the need to recompile.
I also verified that if the saved model was compiled with 'accuracy', loading the model and recompiling with 'sparse_categorical_accuracy' will allow resuming training. However, as mentioned before, this would discard the state of the optimiser and I suspect that it would be no better than just making a new model and loading only the weights from the saved one.

Cannot run Tensorflow code multiple times in Jupyter Notebook

I'm struggling running Tensorflow (v1.1) code multiple times in Jupyter Notebook.
For example, I execute this simple code snippet that creates an encoding layer for a seq2seq model:
# Construct encoder layer (LSTM)
encoder_cell = tf.contrib.rnn.LSTMCell(encoder_hidden_units)
encoder_outputs, encoder_final_state = tf.nn.dynamic_rnn(
encoder_cell, encoder_inputs_embedded,
dtype=tf.float32, time_major=False
)
First time is totally fine, my encoder is created.
However, if I rerun it (no matter the changes I've applied), I get this error:
Attempt to have a second RNNCell use the weights of a variable scope that already has weights
It's very annoying as it forces me to restart the kernel every time I want to change a layer.
Can someone explain me why this happens and how I can fix this ?
Thanks!
You are trying to build the exact same graph twice and therefore TensorFlow complains because the variables already exist in the default graph.
What you could do is to call tf.reset_default_graph() before trying to call the method a second time to ensure you create a new graph when required.
Just in case, I would also suggest using an interactive session as described here in the Start TensorFlow InteractiveSession section:
import tensorflow as tf
sess = tf.InteractiveSession()

TensorFlow float16 support is broken

Recently I tried to train a CNN in TF using float16. To my surprise it is broken in various ways even though TF claims to support it for a while. For example, float16 optimization causes NaN loss already on the second step regardless of the network.
import tensorflow as tf
import numpy as np
slim = tf.contrib.slim
dtype = tf.float16
shape = (4, 16, 16, 3)
inpt = tf.placeholder(dtype, shape, name='input')
net = slim.conv2d(inpt, 16, [3, 3], scope='conv',
weights_initializer=tf.zeros_initializer(),
# normalizer_fn=slim.batch_norm
)
loss = tf.reduce_mean(net)
opt = tf.train.AdamOptimizer(1e-3)
train_op = slim.learning.create_train_op(loss, opt)
val = np.zeros(shape)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(2):
print(sess.run(train_op, feed_dict={inpt: val}))
To my understanding it is clearly a bug: I apply zero convolutions on zero input, I should get zero gradients that don't change zero loss. It just can't diverge. If dtype is float32 it works. NaN loss occurs both on CPU and GPU versions.
However, I was dismissed in GH issues, a random dude closed this issue saying that it is intended behaviour: https://github.com/tensorflow/tensorflow/issues/7226
If you uncomment the line with BN, it will break already on graph construction time because BN assumes moving averages (and beta, gamma) are always float32 and does not cast them properly. This issue was also closed and apparently ignored: https://github.com/tensorflow/tensorflow/issues/7164
I feel like I am talking to a first line IT support of an ISP.
Can anybody explain how I should train with float16 when such a simple "network" fails horribly? And what is the recommended way to report bugs now?
It looks like you need a slightly larger epsilon to avoid numerical instability with zero moments in AdamOptimizer (default is 1e-8). This works for me with float16:
opt = tf.train.AdamOptimizer(1e-3, epsilon=1e-4)
It would be reasonable to request that epsilon be set based on dtype (and presumably such a request, or better yet a pull request, would be met with a more positive response on GitHub). Note that GradientDescentOptimizer has no such issue.