Since I have a large dataset and not much power in my PC, I thought it was a good idea to use TPU on Google Colab.
So, here is my TPU configuration :
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print('Running on TPU ', tpu.master())
except ValueError:
tpu = None
if tpu:
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
strategy = tf.distribute.get_strategy()
print("REPLICAS: ", strategy.num_replicas_in_sync)
And here is my training :
hist = model.fit(train_dataset, epochs=10, verbose=1, steps_per_epoch=count_data_items(filenames)//64)
It is not enough to create a strategy. You should use this strategy correctly.
You probably have to tune your pipeline, increase batch size, etc.
Have a look here: https://cloud.google.com/tpu/docs/performance-guide
Another important point is that TPU has a warm-up period — it spends a lot of time building a computation graph during the first calls (every call with a new input shape).
The number of TPU core available for the Colab notebooks is 8 currently. Takeaways: From observing the training time, it can be seen that the TPU takes considerably more training time than the GPU when the batch size is small. But when batch size increases the TPU performance is comparable to that of the GPU.go through this link for more details
Related
I am using tensorflow 2.5.0 and implemented semantic segmatation network. used DeepLab_v3_plus network with ResNet101 backbone, adam optimizer and Categorical cross entropy loss to train network. I have first build code for single gpu and achieved test accuracy (mean_iou) of 54% trained for 96 epochs. Then added tf MirroredStrategy (one machine) in code to support for multi gpu training. Surprisingly with 2 gpus, training for 48 epochs, test mean_iou is just 27% and training with 4 gpus, for 24 epochs, test mean_iou can around 12% for same dataset.
Code I have modified to support multi-gpu training from single-gpu training.
By following tensorflow blog for distributed training, created mirrored strategy and created model, model compilation and dataset_generator inside strategy scope. As per my understanding, by doing so, model.fit() method will take care of synchronization of gradients and distributing data on each gpus for training. Though code was running without any error, and also training time reduced compared to single gpu for same number of image training, test mean_iou keep getting worst with more number of gpus.
Replaced BatchNormalization with SyncBatchNormalization, but no improvement.
used warmup learning rate with linear scaling of learning rate with number of gpus, but no improvement.
in cross entropy loss, used both losses_utils.ReductionV2.AUTO and losses_utils.ReductionV2.NONE.
loss = ce(y_true, y_pred)
# reshape loss for each sample (BxHxWxC -> BxN)
# Normalize loss by number of non zero elements and sum for each sample and mean across all samples.
using .AUTO/.NONE options, I am not scaling loss by global_batch_size understanding tf will take care of it and I am already normalizing for each gpus. but with both options, didn't get any luck.
changed data_generator to tf.data.Dataset obj. Though it has helped in training time, but test mean_iou become even worst.
I would appreciate if any lead or suggestion for improving test_iou in distributed training.
let me know if you need any additional details.
Thank you
I am using preemptible v2.8 Google Cloud TPUs to perform large-scale hyperparameter optimization. I created the nodes using GKE with tensorflow 2.3 (the latest available version for Cloud TPUs.) Unfortunately, I keep encountering a memory leak on the TPU nodes during the search. This memory leak seems to ultimately cause an "Unavailable: Socket Closed" error (or sometimes an OOM error), where the TPU becomes unable to perform any additional training or evaluation even after re-deploying the code. The problem does not occur when I test my code on either a CPU or GPU.
This problem only occurs on the TPU worker node, but not the controller CPU. (At one point, I had been encountering another memory leak on the CPU due to a buildup of old models and unnecessary operations on the computation graph.) Methods such as tf.backend.clear_session() and del model resolved the memory leak with the CPU, but it persists on the TPU. Here is a graph of the TPU runtime memory usage (the decrease in memory at the end appears to occur after the TPU disconnects because GKE deletes it automatically):
Ultimately, as the used memory increases on the TPU, I get the following error:
raise_from tensorflow.python.framework.errors_impl.ResourceExhaustedError: 9 root error(s) found.
Error
2021-08-02T16:36:47.652282141ZHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Error
2021-08-02T16:36:47.652288611Z
Error
2021-08-02T16:36:47.652296423Z (4) Resource exhausted: {{function_node __inference_train_function_37854}} Attempting to reserve 3.27G at the bottom of memory. That was not possible. There are 3.48G free, 0B reserved, and 1.67G reservable.
Error
2021-08-02T16:36:47.652313550Z [[{{node cluster_train_function/_execute_4_0}}]]
2021-08-02T16:36:47.652921274Z0 successful operations.
Error
2021-08-02T16:36:47.654639274Z0 derived errors ignored.
Occasionally, I instead get an "Unavailable: Socket Closed" error or an "Unable to destroy remote tensor handles" error.
This error typically only occurs after training several networks. I tried multiple methods suggested by other posts to fix the error, such as typecasting my data to float32, not caching my dataset into memory, using a smaller mini batch size to decrease memory consumption, and using "from_logits=True" in my cost function. I even tried using multiprocessing to perform the network training so memory would be cleared after each network evaluation, but for some reason, the Cloud TPU fails to execute any of the for loops in my code or in the training code (a problem I did not have with either a GPU or CPU, cloud or otherwise.) Larger networks seem to cause the problem to occur much more quickly than smaller networks, which suggests to me that old, unused models are still kept in memory on the TPU. Is there any way to clear the memory on the TPU or reset its state to stop this memory leak?
Here is an MVE I wrote to duplicate the problem:
import os
import gc
import sys
import random
import numpy as np
import tensorflow as tf
from sklearn import metrics
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Conv2D, Flatten, Dense
from tensorflow.keras.optimizers import Adam
h = 128
w = 128
channels = 1
mini_batch_size = 256
epochs = 15
using_tpu = True
if using_tpu:
## Get tpu name from arguments
tpu_name = sys.argv[1]
tpu_name = tpu_name.replace('--tpu=', '')
## Initialize TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_name) # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.TPUStrategy(tpu)
def create_network():
strategy = tf.distribute.TPUStrategy(tpu)
with strategy.scope():
## Create random data
x_train = np.random.randn(1024, 128, 128, 1).astype('float32') # astype necessary to help prevent Connect to Socket Error
y_train = np.random.randn(1024, 50).astype('float32')
x_test = np.random.randn(256, 128, 128, 1).astype('float32')
y_test = np.random.randn(256, 50).astype('float32')
model = Sequential()
model.add(InputLayer((h, w, channels)))
layers = 5
ks = [np.random.choice([3, 5, 7]) for l in range(layers)]
filters = [np.random.choice([64, 128, 256]) for l in range(layers)]
for l in range(layers):
model.add(
Conv2D(kernel_size=(ks[l], ks[l]), padding='same',
filters=filters[l], name='conv' + str(l), activation='relu'))
model.add(Flatten())
# Softmax output layer
model.add(Dense(50)) # Don't need softmax activation because from_logits performs that operation automatically
lr = 0.001
opt = Adam(learning_rate=lr, decay=1e-6)
model.compile(optimizer=opt, loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
model.fit(x_train, y_train, epochs=epochs, batch_size=mini_batch_size, shuffle=True, verbose=1)
##### memory leak also occurs with dataset API:
'''
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(mini_batch_size,
drop_remainder=True)
model.fit(train_dataset, epochs=epochs, verbose=1, shuffle=shuffle,
steps_per_epoch=len(x_train) // mini_batch_size)
'''
#######
y_pred = model(x_test)
## Attempt to clear memory
print(gc.collect())
del model
tf.keras.backend.clear_session()
while True:
create_network()
Thank you so much! Please let me know if I should include any other information.
A few things:
Your error message:
(4) Resource exhausted: {{function_node __inference_train_function_37854}} Attempting to reserve 3.27G at the bottom of memory. That was not possible. There are 3.48G free, 0B reserved, and 1.67G reservable.
indicates an HBM OOM rather than a memory OOM. Basically the TPU has its own set of memory on the chips - in this case you've exhausted that memory. If it was an OOM (like RAM OOM), then you will likely see the SocketClosed error, which you saw as well.
That being said, what are your options? I suggest you go with the tf.data approach but with a few modifications:
def get_dataset(is_training: bool):
def generate_data(_):
return tf.random.normal([128, 128, 1], dtype=tf.bfloat16)
dataset = tf.data.Dataset.range(1)
dataset = dataset.repeat()
dataset = dataset.map(generate_data, num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.repeat().batch(mini_batch_size, drop_remainder=is_training)
train_dataset = get_dataset(is_training=True)
eval_dataset = get_dataset(is_training=False)
In this example we can use bfloat16 which reduces the memory footprint on HBM, but you may need to further reduce your minibatch size from 1024 to 512. Alternatively you can go up from v2-8 to v3-8 which has 2x the HBM. I'm not sure if the numpy based method contributes to the OOMs/SocketClosed errors you see, but I don't think this approach should run into that. Of course you'll eventually use real data, and in that case you should use tf.data for optimal performance. More info here.
IIUC tf.backend.clear_session() and gc.collect() will only clear memory on your host VM, not on the TPU server.
PS: You can also use the steps_per_execution flag to further improve the performance. Please see here for more info. Basically this prevents execution from continuously switching from CPU to TPU every step. If you set this to equal the number of training steps in an epoch, this will give you the best throughput.
From the official TPU documentation, it says that train files must be on GCP
https://cloud.google.com/tpu/docs/troubleshooting#cannot_use_local_filesystem
But I have a smaller dataset (but the training would take a very long time due to the training being based on sampling/permutations) which can be all loaded into memory (1-2 gb). I am wondering if I can somehow just transfer the data objects to the TPU directly, and it can use that to train the files.
If it makes a difference, I am using Keras to do my TPU training.
What I looked at so far:
It seems that you can loaded certain data onto individual TPU cores
self.workers = ['/job:worker/replica:0/task:0/device:TPU:' + str(i) for i in range(num_tpu_cores)]
with tf.device(worker[0):
vecs = vectors[i]
However, I am not sure if this would translate into coordinated training among all the TPU cores.
You can read files with Python:
with open(image_path, "rb") as local_file:
img = local_file.read()
1-2 GB may be too big for TPU. If you are out of memory - split your data to smaller portions.
I trained a ResNet50 model using Tensorflow 2.0 by transfer learning. I slightly modified the architecture (new classification layer) and saved the model with the ModelCheckpoint callback https://keras.io/callbacks/#modelcheckpoint during training. Training was fine. The model saved by callback takes ~206 MB on the hard drive.
To predict using the model I did:
I started a Jupyter Lab notebook. I used my_model = tf.keras.models.load_model('../models_using/my_model.hdf5') for loading the model. (btw, the same occurs using IPython).
I used the free linux command line tool to measure the free RAM just before the loading and after. The model loading takes about 5 GB of RAM.
I saved the weights of the model and the config as json. This takes about 105 MB.
I loaded the model from the json config and weights. This takes about ~200 MB of RAM.
Compared the predictions of both models. Exactly the same.
I tested the same procedure with a slightly different architeture (trained the same way) and the results were the same.
Can anyone explain the huge RAM usage, and the difference in size of the models on the hard drive?
Btw, given a model in Keras, can you find out the compliation procedure ( optimizer,..)? Model.summary() does not help..
2019-12-07 - EDIT: Thanks to this answer, I conducted a series of tests:
I used the !free command in JupyterLab to measure the available memory before and after each test. Since I get_weights returns a list, I used copy.deepcopy to really copy the objects. Note, the commands below were separate Jupyter cells and the memory comments were added just for this answer.
!free
model = tf.keras.models.load_model('model.hdf5', compile=True)
# 25278624 - 21491888 = 3786.736 MB used
!free
weights = copy.deepcopy(model.get_weights())
# 21491888 - 21440272 = 51.616 MB used
!free
optimizer_weights = copy.deepcopy(model.optimizer.get_weights())
# 21440272 - 21339404 = 100.868 MB used
!free
model2 = tf.keras.models.load_model('model.hdf5', compile=False)
# 21339404 - 21140176 = 199.228 MB used
!free
Loading the model from json:
!free
# loading from json
with open('model_json.json') as f:
model_json_weights = tf.keras.models.model_from_json(f.read())
model_json_weights.load_weights('model_weights.h5')
!free
# 21132664 - 20971616 = 161.048 MB used
The difference between checkpoint and JSON+Weights is in the optimizer:
The checkpoint or model.save() save the optimizer and its weights (load_model compiles the model)
JSON + weights doesn't save the optimizer
Unless you are using a very simple optimizer, it's normal for it to have about the same number of weights as the model (a tensor of "momentum" for each weight tensor, for instance).
Some optimizers might take two times the size of the model, because it has two tensors of optimizer weights for each tensor of model weights.
Saving and loading the optimizer is important if you want to continue training. Starting training again with a new optimizer without proper weights will sort of destroy the model's performance (at least in the beginning).
Now, the 5GB is not really clear to me. But I suppose that:
There should be a lot of compression in saved weights
It might have to do with also allocating memory for all the gradient and backpropagation operations
Interesting tests:
Compression: check how much memory is used by the results of model.get_weights() and model.optimizer.get_weights(). These weights will be numpy, copied from the original tensors
Grandient/Backpropagation: check how much memory is used by:
load_model(name, compile=True)
load_model(name, compile=False)
I had a very similar issue with a recent model that I was attempting to load. All be it this is a year old issue, so I am uncertain if my solution will work for you. However, reading through the keras documentation of saved model.
I found this piece of code to be very useful:
physical_devices = tf.config.list_physical_devices('GPU')
for device in physical_devices:
tf.config.experimental.set_memory_growth(device, True)
Can anyone explain the huge RAM usage, and the difference in size of the models on the hard drive?
It turns out in my case the loaded model was using up all of the GPU memory and causing issues so this forces it to use physical device memory or at least that is my conclusion.
Memory Leaks With TF.Data.Dataset pipeline. Is there a profiler to identify the memory leaks in the pipeline or tf.keras training?
Few questions if you have any thoughts –
1. Is there an obvious problem in the pseudo code that I am overlooking?
2. Any thoughts on where/what to look for?
3. Any pointers to how to memory profile RAM usage as the training goes on to pin point problem?
I just moved my codebase to eager mode under Tensorflow 1.15 and I am running into memory issues that I didn’t have before. Before moving to eager mode, I could training for 500+ epocs without any issues and now, training stops after 70 epocs. I am trying to figure out a way to identify where the leak is and I was hoping some of you have some ideas.
I am using tf.data.Dataset to build the data pipeline (see pseudo code below) and to speed up the data feeding, I am using datasets with interleave as shown below. I have preprocessed data that is stored in sharded TFRecords files and the dataset API loads up data and does minimal processing to supply the appropriate batch sized data. GPU memory seems fine and training goes on until the CPU RAM is completely depleted. As you see the table below, psutil memory log shows continuous increase of CPU RAM.
What I have tried:
Explicitly call gc.collect, tf.set_random_seed(1) as suggested by these but none seems to help.
https://github.com/tensorflow/tensorflow/issues/30324
Memory Continually Increasing When Iterating in Tensorflow Eager Execution
Ubuntu 18.04, tf-nightly-gpu 1.15.0.dev20190728
CPU - AMD Ryzen Threadripper 1920X 12-Core Processor
RAM – 128GB
GPU - RTX 2080 Ti 11GB
#Generator that is passed to the fit_generator
def get_simple_Dataset_generator(….):
dataset = load_dataset (…)
while True:
try:
for x, Y in dataset:
yield x, Y
finally:
dataset = load_dataset (“change data sources”)
#tried gc.collect(), tf.set_random_seed(1)
#sets up the dataset with interleave.
def load_dataset(…):
#setup etc
dataset = dataset.interleave(lambda x:
tf.data.Dataset.from_generator(self.simple_gen_step1,
output_types=(tf.string, tf.float32, tf.string), args=(x,batch_size, lstm_reshape,)),
cycle_length=2,
block_length=1)
dataset = dataset.interleave(lambda each_ticker, each_dataset, each_dates: tf.data.Dataset.from_generator(self.simple_gen_step2,
output_types=(tf.float32, tf.int16), args=(names, dataset, dates,batch_size,)),
cycle_length=2,
block_length=1)
return dataset
#Our Model uses CuDNNLSTM and Dense layers
build_model():
model = Sequential()
model.add(CuDNNLSTM(feature_count,
batch_input_shape=(batch_size,look_back, feature_count),
stateful=Settings.get_config(Settings.STATEFUL),
return_sequences=True))
model.add(CuDNNLSTM(feature_count, return_sequences = True))
model.add(CuDNNLSTM(feature_count, return_sequences = True))
model.add(CuDNNLSTM(feature_count, return_sequences = False))
model.add(Dropout(dropout))
model.add(Dense( max(feature_count//(2*1), target_classes), use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(max(feature_count//(2*2), target_classes),use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(max(feature_count//(2*3), target_classes), use_bias=False))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(dropout))
model.add(Dense(target_classes, activation='softmax'))
return model
CPU RAM Shown in psutil log
For anyone running into similar issue, I think there is a memory leak if class_weights are used in fit_generator. I have posted another one with more details.
Using class_weights in fit_generator causes memory leak
I thought I'd share what I have found regarding memory leakage in TensorFlow 2.x.x. It might not be a 100 % specific answer to your concrete questions but it might help others to solve their memory leakage issues when using built-in functions like model.fit().
Here is a link to one of the related GitHub issues and here is my solution (please also cosider the comments to my solution).