Identify memory leak in tensorflow data pipeline and training? - tensorflow

Memory Leaks With TF.Data.Dataset pipeline. Is there a profiler to identify the memory leaks in the pipeline or tf.keras training?
Few questions if you have any thoughts –
1. Is there an obvious problem in the pseudo code that I am overlooking?
2. Any thoughts on where/what to look for?
3. Any pointers to how to memory profile RAM usage as the training goes on to pin point problem?
I just moved my codebase to eager mode under Tensorflow 1.15 and I am running into memory issues that I didn’t have before. Before moving to eager mode, I could training for 500+ epocs without any issues and now, training stops after 70 epocs. I am trying to figure out a way to identify where the leak is and I was hoping some of you have some ideas.
I am using tf.data.Dataset to build the data pipeline (see pseudo code below) and to speed up the data feeding, I am using datasets with interleave as shown below. I have preprocessed data that is stored in sharded TFRecords files and the dataset API loads up data and does minimal processing to supply the appropriate batch sized data. GPU memory seems fine and training goes on until the CPU RAM is completely depleted. As you see the table below, psutil memory log shows continuous increase of CPU RAM.
What I have tried:
Explicitly call gc.collect, tf.set_random_seed(1) as suggested by these but none seems to help.
https://github.com/tensorflow/tensorflow/issues/30324
Memory Continually Increasing When Iterating in Tensorflow Eager Execution
Ubuntu 18.04, tf-nightly-gpu 1.15.0.dev20190728
CPU - AMD Ryzen Threadripper 1920X 12-Core Processor
RAM – 128GB
GPU - RTX 2080 Ti 11GB
#Generator that is passed to the fit_generator
def get_simple_Dataset_generator(….):
dataset = load_dataset (…)
while True:
try:
for x, Y in dataset:
yield x, Y
finally:
dataset = load_dataset (“change data sources”)
#tried gc.collect(), tf.set_random_seed(1)
#sets up the dataset with interleave.
def load_dataset(…):
#setup etc
dataset = dataset.interleave(lambda x:
tf.data.Dataset.from_generator(self.simple_gen_step1,
output_types=(tf.string, tf.float32, tf.string), args=(x,batch_size, lstm_reshape,)),
cycle_length=2,
block_length=1)
dataset = dataset.interleave(lambda each_ticker, each_dataset, each_dates: tf.data.Dataset.from_generator(self.simple_gen_step2,
output_types=(tf.float32, tf.int16), args=(names, dataset, dates,batch_size,)),
cycle_length=2,
block_length=1)
return dataset
#Our Model uses CuDNNLSTM and Dense layers
build_model():
model = Sequential()
model.add(CuDNNLSTM(feature_count,
batch_input_shape=(batch_size,look_back, feature_count),
stateful=Settings.get_config(Settings.STATEFUL),
return_sequences=True))
model.add(CuDNNLSTM(feature_count, return_sequences = True))
model.add(CuDNNLSTM(feature_count, return_sequences = True))
model.add(CuDNNLSTM(feature_count, return_sequences = False))
model.add(Dropout(dropout))
model.add(Dense( max(feature_count//(2*1), target_classes), use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(max(feature_count//(2*2), target_classes),use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(max(feature_count//(2*3), target_classes), use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(target_classes, activation='softmax'))
return model
CPU RAM Shown in psutil log

For anyone running into similar issue, I think there is a memory leak if class_weights are used in fit_generator. I have posted another one with more details.
Using class_weights in fit_generator causes memory leak

I thought I'd share what I have found regarding memory leakage in TensorFlow 2.x.x. It might not be a 100 % specific answer to your concrete questions but it might help others to solve their memory leakage issues when using built-in functions like model.fit().
Here is a link to one of the related GitHub issues and here is my solution (please also cosider the comments to my solution).

Related

Memory leak with TPUs on GKE causing OOM/"Unavailable: Socket Closed" error

I am using preemptible v2.8 Google Cloud TPUs to perform large-scale hyperparameter optimization. I created the nodes using GKE with tensorflow 2.3 (the latest available version for Cloud TPUs.) Unfortunately, I keep encountering a memory leak on the TPU nodes during the search. This memory leak seems to ultimately cause an "Unavailable: Socket Closed" error (or sometimes an OOM error), where the TPU becomes unable to perform any additional training or evaluation even after re-deploying the code. The problem does not occur when I test my code on either a CPU or GPU.
This problem only occurs on the TPU worker node, but not the controller CPU. (At one point, I had been encountering another memory leak on the CPU due to a buildup of old models and unnecessary operations on the computation graph.) Methods such as tf.backend.clear_session() and del model resolved the memory leak with the CPU, but it persists on the TPU. Here is a graph of the TPU runtime memory usage (the decrease in memory at the end appears to occur after the TPU disconnects because GKE deletes it automatically):
Ultimately, as the used memory increases on the TPU, I get the following error:
raise_from tensorflow.python.framework.errors_impl.ResourceExhaustedError: 9 root error(s) found.
Error
2021-08-02T16:36:47.652282141ZHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Error
2021-08-02T16:36:47.652288611Z
Error
2021-08-02T16:36:47.652296423Z (4) Resource exhausted: {{function_node __inference_train_function_37854}} Attempting to reserve 3.27G at the bottom of memory. That was not possible. There are 3.48G free, 0B reserved, and 1.67G reservable.
Error
2021-08-02T16:36:47.652313550Z [[{{node cluster_train_function/_execute_4_0}}]]
2021-08-02T16:36:47.652921274Z0 successful operations.
Error
2021-08-02T16:36:47.654639274Z0 derived errors ignored.
Occasionally, I instead get an "Unavailable: Socket Closed" error or an "Unable to destroy remote tensor handles" error.
This error typically only occurs after training several networks. I tried multiple methods suggested by other posts to fix the error, such as typecasting my data to float32, not caching my dataset into memory, using a smaller mini batch size to decrease memory consumption, and using "from_logits=True" in my cost function. I even tried using multiprocessing to perform the network training so memory would be cleared after each network evaluation, but for some reason, the Cloud TPU fails to execute any of the for loops in my code or in the training code (a problem I did not have with either a GPU or CPU, cloud or otherwise.) Larger networks seem to cause the problem to occur much more quickly than smaller networks, which suggests to me that old, unused models are still kept in memory on the TPU. Is there any way to clear the memory on the TPU or reset its state to stop this memory leak?
Here is an MVE I wrote to duplicate the problem:
import os
import gc
import sys
import random
import numpy as np
import tensorflow as tf
from sklearn import metrics
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Conv2D, Flatten, Dense
from tensorflow.keras.optimizers import Adam
h = 128
w = 128
channels = 1
mini_batch_size = 256
epochs = 15
using_tpu = True
if using_tpu:
## Get tpu name from arguments
tpu_name = sys.argv[1]
tpu_name = tpu_name.replace('--tpu=', '')
## Initialize TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_name) # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.TPUStrategy(tpu)
def create_network():
strategy = tf.distribute.TPUStrategy(tpu)
with strategy.scope():
## Create random data
x_train = np.random.randn(1024, 128, 128, 1).astype('float32') # astype necessary to help prevent Connect to Socket Error
y_train = np.random.randn(1024, 50).astype('float32')
x_test = np.random.randn(256, 128, 128, 1).astype('float32')
y_test = np.random.randn(256, 50).astype('float32')
model = Sequential()
model.add(InputLayer((h, w, channels)))
layers = 5
ks = [np.random.choice([3, 5, 7]) for l in range(layers)]
filters = [np.random.choice([64, 128, 256]) for l in range(layers)]
for l in range(layers):
model.add(
Conv2D(kernel_size=(ks[l], ks[l]), padding='same',
filters=filters[l], name='conv' + str(l), activation='relu'))
model.add(Flatten())
# Softmax output layer
model.add(Dense(50)) # Don't need softmax activation because from_logits performs that operation automatically
lr = 0.001
opt = Adam(learning_rate=lr, decay=1e-6)
model.compile(optimizer=opt, loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
model.fit(x_train, y_train, epochs=epochs, batch_size=mini_batch_size, shuffle=True, verbose=1)
##### memory leak also occurs with dataset API:
'''
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(mini_batch_size,
drop_remainder=True)
model.fit(train_dataset, epochs=epochs, verbose=1, shuffle=shuffle,
steps_per_epoch=len(x_train) // mini_batch_size)
'''
#######
y_pred = model(x_test)
## Attempt to clear memory
print(gc.collect())
del model
tf.keras.backend.clear_session()
while True:
create_network()
Thank you so much! Please let me know if I should include any other information.
A few things:
Your error message:
(4) Resource exhausted: {{function_node __inference_train_function_37854}} Attempting to reserve 3.27G at the bottom of memory. That was not possible. There are 3.48G free, 0B reserved, and 1.67G reservable.
indicates an HBM OOM rather than a memory OOM. Basically the TPU has its own set of memory on the chips - in this case you've exhausted that memory. If it was an OOM (like RAM OOM), then you will likely see the SocketClosed error, which you saw as well.
That being said, what are your options? I suggest you go with the tf.data approach but with a few modifications:
def get_dataset(is_training: bool):
def generate_data(_):
return tf.random.normal([128, 128, 1], dtype=tf.bfloat16)
dataset = tf.data.Dataset.range(1)
dataset = dataset.repeat()
dataset = dataset.map(generate_data, num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.repeat().batch(mini_batch_size, drop_remainder=is_training)
train_dataset = get_dataset(is_training=True)
eval_dataset = get_dataset(is_training=False)
In this example we can use bfloat16 which reduces the memory footprint on HBM, but you may need to further reduce your minibatch size from 1024 to 512. Alternatively you can go up from v2-8 to v3-8 which has 2x the HBM. I'm not sure if the numpy based method contributes to the OOMs/SocketClosed errors you see, but I don't think this approach should run into that. Of course you'll eventually use real data, and in that case you should use tf.data for optimal performance. More info here.
IIUC tf.backend.clear_session() and gc.collect() will only clear memory on your host VM, not on the TPU server.
PS: You can also use the steps_per_execution flag to further improve the performance. Please see here for more info. Basically this prevents execution from continuously switching from CPU to TPU every step. If you set this to equal the number of training steps in an epoch, this will give you the best throughput.

Keras OOM for data validation using GPU

I'm trying to run a deep model using GPU and seems Keras is running the validation against the whole validation data set in one batch instead of validating in many batches and that's causing out of memory problem:
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM
when allocating tensor with shape[160000,64,64,1] and type double on
/job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[Op:GatherV2]
I did not have this problem when I was running on CPU, it's just happening when I'm running on GPU, my fit code looks like this
history = model.fit(patches_imgs_train, patches_masks_train, batch_size=8,
epochs=10, shuffle=True, verbose=1, validation_split=0.2)
When I delete the validation parameter from the fit method the code works, but I need the validation.
Since no one is answering this, I can offer you a workaround. You can separate fit() and evaluate() and run the evaluation on CPU.
You'll have to split your data manually to provide the testx and testy to evaluate().
for i in range(10):
with tf.device('/GPU:0'):
model.fit(x, y, epochs=1)
with tf.device('/CPU:0'):
loss, acc = model.evaluate(testx, testy)
You'll need deal with the accuracy values if you wanted some early stop.
It isn't perfect but it'll allow you to run much larger networks without OOMs.
Hope it helps.
So I could consider what is happening as a bug in Keras implementation, looks like it's trying to load the whole data set to the memory for splitting it into validation and training sets and it's not related to batch size, after trying many ways to go around it I found the best way to approach it is splitting the data using sklearn train_test_split instead of splitting it down in the fitting method using validation_split param.
x_train, x_v, y_train, y_v = train_test_split(x,y,test_size = 0.2,train_size =0.8)
history = model.fit(x_train,y_train,
batch_size=16,
epochs=5,
shuffle=True,
verbose=2,
validation_data=(x_v, y_v))

Tensorflow ResNet model loading uses **~5 GB of RAM** - while loading from weights uses only ~200 MB

I trained a ResNet50 model using Tensorflow 2.0 by transfer learning. I slightly modified the architecture (new classification layer) and saved the model with the ModelCheckpoint callback https://keras.io/callbacks/#modelcheckpoint during training. Training was fine. The model saved by callback takes ~206 MB on the hard drive.
To predict using the model I did:
I started a Jupyter Lab notebook. I used my_model = tf.keras.models.load_model('../models_using/my_model.hdf5') for loading the model. (btw, the same occurs using IPython).
I used the free linux command line tool to measure the free RAM just before the loading and after. The model loading takes about 5 GB of RAM.
I saved the weights of the model and the config as json. This takes about 105 MB.
I loaded the model from the json config and weights. This takes about ~200 MB of RAM.
Compared the predictions of both models. Exactly the same.
I tested the same procedure with a slightly different architeture (trained the same way) and the results were the same.
Can anyone explain the huge RAM usage, and the difference in size of the models on the hard drive?
Btw, given a model in Keras, can you find out the compliation procedure ( optimizer,..)? Model.summary() does not help..
2019-12-07 - EDIT: Thanks to this answer, I conducted a series of tests:
I used the !free command in JupyterLab to measure the available memory before and after each test. Since I get_weights returns a list, I used copy.deepcopy to really copy the objects. Note, the commands below were separate Jupyter cells and the memory comments were added just for this answer.
!free
model = tf.keras.models.load_model('model.hdf5', compile=True)
# 25278624 - 21491888 = 3786.736 MB used
!free
weights = copy.deepcopy(model.get_weights())
# 21491888 - 21440272 = 51.616 MB used
!free
optimizer_weights = copy.deepcopy(model.optimizer.get_weights())
# 21440272 - 21339404 = 100.868 MB used
!free
model2 = tf.keras.models.load_model('model.hdf5', compile=False)
# 21339404 - 21140176 = 199.228 MB used
!free
Loading the model from json:
!free
# loading from json
with open('model_json.json') as f:
model_json_weights = tf.keras.models.model_from_json(f.read())
model_json_weights.load_weights('model_weights.h5')
!free
# 21132664 - 20971616 = 161.048 MB used
The difference between checkpoint and JSON+Weights is in the optimizer:
The checkpoint or model.save() save the optimizer and its weights (load_model compiles the model)
JSON + weights doesn't save the optimizer
Unless you are using a very simple optimizer, it's normal for it to have about the same number of weights as the model (a tensor of "momentum" for each weight tensor, for instance).
Some optimizers might take two times the size of the model, because it has two tensors of optimizer weights for each tensor of model weights.
Saving and loading the optimizer is important if you want to continue training. Starting training again with a new optimizer without proper weights will sort of destroy the model's performance (at least in the beginning).
Now, the 5GB is not really clear to me. But I suppose that:
There should be a lot of compression in saved weights
It might have to do with also allocating memory for all the gradient and backpropagation operations
Interesting tests:
Compression: check how much memory is used by the results of model.get_weights() and model.optimizer.get_weights(). These weights will be numpy, copied from the original tensors
Grandient/Backpropagation: check how much memory is used by:
load_model(name, compile=True)
load_model(name, compile=False)
I had a very similar issue with a recent model that I was attempting to load. All be it this is a year old issue, so I am uncertain if my solution will work for you. However, reading through the keras documentation of saved model.
I found this piece of code to be very useful:
physical_devices = tf.config.list_physical_devices('GPU')
for device in physical_devices:
tf.config.experimental.set_memory_growth(device, True)
Can anyone explain the huge RAM usage, and the difference in size of the models on the hard drive?
It turns out in my case the loaded model was using up all of the GPU memory and causing issues so this forces it to use physical device memory or at least that is my conclusion.

How To Run Two Models In Parallel On Two Different GPUs In Keras

I want to do a grid search for parameters on neural nets. I have two GPUs, and I would like to run one model on the first GPU, and another model with different parameters on the second GPU. A first attempt that doesn't work goes like this:
with tf.device('/gpu:0'):
model_1 = sequential()
model_1.add(embedding) // the embeddings are defined earlier in the code
model_1.add(LSTM(50))
model_1.add(Dense(5, activation = 'softmax'))
model_1.compile(loss = 'categorical_crossentropy', optimizer = 'adam')
model_1.fit(np.array(train_x), np.array(train_y), epochs = 15, batch_size = 15)
with tf.device('/gpu:1'):
model_2 = sequential()
model_2.add(embedding)
model_2.add(LSTM(100))
model_2.add(Dense(5, activation = 'softmax'))
model_2.compile(loss = 'categorical_crossentropy', optimizer = 'adam')
model_2.fit(np.array(train_x), np.array(train_y), epochs = 15, batch_size = 15)
Edit: I ran my code again and did not get an error. However, the two models run sequentially rather than in parallel. Is it possible to do multithreading here? That is my next attempt.
There is a lot of discussion online about using multiple GPUs with keras, but when it comes to running multiple models simultaneously, the discussion is limited to running multiple models on a single GPU. The discussion regarding multiple GPUs is also limited to data parallelization and device parallelization. I don't believe I want to do either since I am not trying to break up a single model to run on multiple gpus. Is it possible to run two separate models simultaneously in keras with two GPUs?
A solution to this problem can be found here. However, the softmax activation function runs on the CPU only as of now. It is necessary to direct the cpu to perform the dense layer:
with tf.device('cpu:0')
Switching between the cpu and the gpu does not seem cause noticeable slow down. With LSTM's though, it may be best to run the entire model on the cpu.
You can use multi_gpu_model (reference here)
Define your model first
model = sequential()
model.add(embedding) // the embeddings are defined earlier in the code
model.add(LSTM(50))
model.add(Dense(5, activation = 'softmax'))
and the create a multi_gpu_model with 2 GPUs:
parallel_model = multi_gpu_model(model, gpus=2)
This will work if you want to divide the input and process it on 2 GPUs. It will not cover your use case of having two different models on two GPUs though.
Because your code is in sequential. You can try threading to run 2 blocks in parallel. Google "python multithreading" will help you to get lots of examples.

Using batch size with TensorFlow Validation Monitor

I'm using tf.contrib.learn.Estimator to train a CNN having 20+ layers. I'm using GTX 1080 (8 GB) for training. My dataset is not so large but my GPU runs out of memory with a batch size greater than 32. So I'm using a batch size of 16 for training and Evaluating the classifier (GPU runs out of memory while evaluation as well if a batch_size is not specified).
# Configure the accuracy metric for evaluation
metrics = {
"accuracy":
learn.MetricSpec(
metric_fn=tf.metrics.accuracy, prediction_key="classes"),
}
# Evaluate the model and print results
eval_results = classifier.evaluate(
x=X_test, y=y_test, metrics=metrics, batch_size=16)
Now the problem is that after every 100 steps, I only get training loss printed on screen. I want to print validation loss and accuracy as well, So I'm using a ValidationMonitor
validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
X_test,
y_test,
every_n_steps=50)
# Train the model
classifier.fit(
x=X_train,
y=y_train,
batch_size=8,
steps=20000,
monitors=[validation_monitor]
ActualProblem: My code crashes (Out of Memory) when I use ValidationMonitor, I think the problem might be solved if I could specify a batch size here as well and I can't figure out how to do that. I want ValidationMonitor to evaluate my validation data in batches, like I do it manually after training using classifier.evaluate, is there a way to do that?
The ValidationMonitor's constructor accepts a batch_size arg that should do the trick.
You need to add config=tf.contrib.learn.RunConfig( save_checkpoints_secs=save_checkpoints_secs) in your model definition. The save_checkpoints_secs can be changed to save_checkpoints_steps, but not both.