Train on colab TPU without data from GCP, for data that can be all loaded into memory - tensorflow

From the official TPU documentation, it says that train files must be on GCP
https://cloud.google.com/tpu/docs/troubleshooting#cannot_use_local_filesystem
But I have a smaller dataset (but the training would take a very long time due to the training being based on sampling/permutations) which can be all loaded into memory (1-2 gb). I am wondering if I can somehow just transfer the data objects to the TPU directly, and it can use that to train the files.
If it makes a difference, I am using Keras to do my TPU training.
What I looked at so far:
It seems that you can loaded certain data onto individual TPU cores
self.workers = ['/job:worker/replica:0/task:0/device:TPU:' + str(i) for i in range(num_tpu_cores)]
with tf.device(worker[0):
vecs = vectors[i]
However, I am not sure if this would translate into coordinated training among all the TPU cores.

You can read files with Python:
with open(image_path, "rb") as local_file:
img = local_file.read()
1-2 GB may be too big for TPU. If you are out of memory - split your data to smaller portions.

Related

Low RAM Usage & high GPU usage, HF Datasets not helping in fine-tuning LM

I've been trying to finetune "roBERTa base" model using the MLM task. (https://huggingface.co/course/en/chapter7/3) with the sample training dataset of 20_000 points, preprocessed according to the need.
I've used HuggingFace datasets (https://huggingface.co/docs/datasets/v2.6.1/en/package_reference/loading_methods#datasets.load_dataset) to read the data points & I am also using the data collate function to achieve dynamic masking for every batch.
I am using g4dn.2xlarge (https://instances.vantage.sh/aws/ec2/g4dn.2xlarge - 32GB RAM, 16GB GPU, 8 vCPUs) instance for fine-tuning Roberta-base MLM task with a brach size of 8 & each datapoint is of 512 sequence length. And I am using HuggingFace Trainer API
With the above config, I observed GPU memory was very high, 95%+ and system RAM utilization was around 13-15%!
I did set follow the https://huggingface.co/docs/datasets/v1.12.0/cache.html & set IN_MEMORY_MAX_SIZE to ~25GB, but no luck
import datasets
datasets.config.IN_MEMORY_MAX_SIZE = 24_696_061_952
train_dataset = load_dataset('pandas',
data_files={'train': 'path to pickle file'},
keep_in_memory=True)
But RAM usage remained as it is. How can I make full utilization of RAM & GPU memory?
I've to take 20_000 as the sample for this experiment but I've ~1 Million data points, which I will be using in the future for full-fledged training, post-resolution to this problem
Any suggestions?
Huggingface discussion: https://discuss.huggingface.co/t/low-ram-usage-high-gpu-usage-datasets-not-helping/29269/1

Why is Google colab TPU slower than CPU or GPU? [duplicate]

Since I have a large dataset and not much power in my PC, I thought it was a good idea to use TPU on Google Colab.
So, here is my TPU configuration :
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
print('Running on TPU ', tpu.master())
except ValueError:
tpu = None
if tpu:
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
strategy = tf.distribute.get_strategy()
print("REPLICAS: ", strategy.num_replicas_in_sync)
And here is my training :
hist = model.fit(train_dataset, epochs=10, verbose=1, steps_per_epoch=count_data_items(filenames)//64)
It is not enough to create a strategy. You should use this strategy correctly.
You probably have to tune your pipeline, increase batch size, etc.
Have a look here: https://cloud.google.com/tpu/docs/performance-guide
Another important point is that TPU has a warm-up period — it spends a lot of time building a computation graph during the first calls (every call with a new input shape).
The number of TPU core available for the Colab notebooks is 8 currently. Takeaways: From observing the training time, it can be seen that the TPU takes considerably more training time than the GPU when the batch size is small. But when batch size increases the TPU performance is comparable to that of the GPU.go through this link for more details

how to work with large training set when dealing with auto-encoders on google colaboratory?

I am training an auto-encoder (keras) on google colab. however, I have 25000 input image and 25000 output image. I tried to:
1- copy the large file from google drive to colab each time (takes 5-6 hours).
2- convert the set to numpy array but when normalizing the images, the size get a lot bigger (from 7GB to 24GB for example) and then I can not fit it into the ram memory.
3- I can not zip and unzip my data.
So please, if anyone knows how to convert it into numpy array( and normalize it) without having large file(24GB).
What I usually do :
Zip all the images and load the .zip file on your Google Drive
Dezip in your colab :
from zipfile import ZipFile
with ZipFile('data.zip', 'r') as zip:
zip.extractall()
All your images are dezipped and stored on the Colab Disk, now you can have a faster acces to them.
Use Generators in keras like flow_from_directory or create your own generator
use you generator when you fit your model :
moel.fit(train_generator, steps_per_epoch = ntrain // batch_size,
epochs=epochs,validation_data=val_generator,
validation_steps= nval // batch_size)
with ntrain and nval the number of images in your train and validation dataset

Tensorflow ResNet model loading uses **~5 GB of RAM** - while loading from weights uses only ~200 MB

I trained a ResNet50 model using Tensorflow 2.0 by transfer learning. I slightly modified the architecture (new classification layer) and saved the model with the ModelCheckpoint callback https://keras.io/callbacks/#modelcheckpoint during training. Training was fine. The model saved by callback takes ~206 MB on the hard drive.
To predict using the model I did:
I started a Jupyter Lab notebook. I used my_model = tf.keras.models.load_model('../models_using/my_model.hdf5') for loading the model. (btw, the same occurs using IPython).
I used the free linux command line tool to measure the free RAM just before the loading and after. The model loading takes about 5 GB of RAM.
I saved the weights of the model and the config as json. This takes about 105 MB.
I loaded the model from the json config and weights. This takes about ~200 MB of RAM.
Compared the predictions of both models. Exactly the same.
I tested the same procedure with a slightly different architeture (trained the same way) and the results were the same.
Can anyone explain the huge RAM usage, and the difference in size of the models on the hard drive?
Btw, given a model in Keras, can you find out the compliation procedure ( optimizer,..)? Model.summary() does not help..
2019-12-07 - EDIT: Thanks to this answer, I conducted a series of tests:
I used the !free command in JupyterLab to measure the available memory before and after each test. Since I get_weights returns a list, I used copy.deepcopy to really copy the objects. Note, the commands below were separate Jupyter cells and the memory comments were added just for this answer.
!free
model = tf.keras.models.load_model('model.hdf5', compile=True)
# 25278624 - 21491888 = 3786.736 MB used
!free
weights = copy.deepcopy(model.get_weights())
# 21491888 - 21440272 = 51.616 MB used
!free
optimizer_weights = copy.deepcopy(model.optimizer.get_weights())
# 21440272 - 21339404 = 100.868 MB used
!free
model2 = tf.keras.models.load_model('model.hdf5', compile=False)
# 21339404 - 21140176 = 199.228 MB used
!free
Loading the model from json:
!free
# loading from json
with open('model_json.json') as f:
model_json_weights = tf.keras.models.model_from_json(f.read())
model_json_weights.load_weights('model_weights.h5')
!free
# 21132664 - 20971616 = 161.048 MB used
The difference between checkpoint and JSON+Weights is in the optimizer:
The checkpoint or model.save() save the optimizer and its weights (load_model compiles the model)
JSON + weights doesn't save the optimizer
Unless you are using a very simple optimizer, it's normal for it to have about the same number of weights as the model (a tensor of "momentum" for each weight tensor, for instance).
Some optimizers might take two times the size of the model, because it has two tensors of optimizer weights for each tensor of model weights.
Saving and loading the optimizer is important if you want to continue training. Starting training again with a new optimizer without proper weights will sort of destroy the model's performance (at least in the beginning).
Now, the 5GB is not really clear to me. But I suppose that:
There should be a lot of compression in saved weights
It might have to do with also allocating memory for all the gradient and backpropagation operations
Interesting tests:
Compression: check how much memory is used by the results of model.get_weights() and model.optimizer.get_weights(). These weights will be numpy, copied from the original tensors
Grandient/Backpropagation: check how much memory is used by:
load_model(name, compile=True)
load_model(name, compile=False)
I had a very similar issue with a recent model that I was attempting to load. All be it this is a year old issue, so I am uncertain if my solution will work for you. However, reading through the keras documentation of saved model.
I found this piece of code to be very useful:
physical_devices = tf.config.list_physical_devices('GPU')
for device in physical_devices:
tf.config.experimental.set_memory_growth(device, True)
Can anyone explain the huge RAM usage, and the difference in size of the models on the hard drive?
It turns out in my case the loaded model was using up all of the GPU memory and causing issues so this forces it to use physical device memory or at least that is my conclusion.

How to fix low volatile GPU-Util with Tensorflow-GPU and Keras?

I have a 4 GPU machine on which I run Tensorflow (GPU) with Keras. Some of my classification problems take several hours to complete.
nvidia-smi returns Volatile GPU-Util which never exceeds 25% on any of my 4 GPUs.
How can I increase GPU Util% and speed up my training?
If your GPU util is below 80%, this is generally the sign of an input pipeline bottleneck. What this means is that the GPU sits idle much of the time, waiting for the CPU to prepare the data:
What you want is the CPU to keep preparing batches while the GPU is training to keep the GPU fed. This is called prefetching:
Great, but if the batch preparation is still way longer than the model training, the GPU will still remain idle, waiting for the CPU to finish the next batch. To make the batch preparation faster we can parallelize the different preprocessing operations:
We can go even further by parallelizing I/O:
Now to implement this in Keras, you need to use the Tensorflow Data API with Tensorflow version >= 1.9.0. Here is an example:
Let's assume, for the sake of this example that you have two numpy arrays x and y. You can use tf.data for any type of data but this is simpler to understand.
def preprocessing(x, y):
# Can only contain TF operations
...
return x, y
dataset = tf.data.Dataset.from_tensor_slices((x, y)) # Creates a dataset object
dataset = dataset.map(preprocessing, num_parallel_calls=64) # parallel preprocessing
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(None) # Will automatically prefetch batches
....
model = tf.keras.model(...)
model.fit(x=dataset) # Since tf 1.9.0 you can pass a dataset object
tf.data is very flexible, but as anything in Tensorflow (except eager), it uses a static graph. This can be a pain sometimes but the speed up is worth it.
To go further, you can have a look at the performance guide and the Tensorflow data guide.
I've got similar issue - the memory of all the GPUs were allocated by Keras, but Volatile was around 0% and training was taking almost the same amount of time as on CPU. I was using ImageDataGenerator, which turned out to be a bottleneck. When I increased the number of workers in fit_generator method from default value 1 to all available CPUs, then the training time dropped rapidly.
You can also load the data to the memory and then use flow method to prepare batches with augmented images.