Tensorflow ResNet model loading uses **~5 GB of RAM** - while loading from weights uses only ~200 MB - tensorflow

I trained a ResNet50 model using Tensorflow 2.0 by transfer learning. I slightly modified the architecture (new classification layer) and saved the model with the ModelCheckpoint callback https://keras.io/callbacks/#modelcheckpoint during training. Training was fine. The model saved by callback takes ~206 MB on the hard drive.
To predict using the model I did:
I started a Jupyter Lab notebook. I used my_model = tf.keras.models.load_model('../models_using/my_model.hdf5') for loading the model. (btw, the same occurs using IPython).
I used the free linux command line tool to measure the free RAM just before the loading and after. The model loading takes about 5 GB of RAM.
I saved the weights of the model and the config as json. This takes about 105 MB.
I loaded the model from the json config and weights. This takes about ~200 MB of RAM.
Compared the predictions of both models. Exactly the same.
I tested the same procedure with a slightly different architeture (trained the same way) and the results were the same.
Can anyone explain the huge RAM usage, and the difference in size of the models on the hard drive?
Btw, given a model in Keras, can you find out the compliation procedure ( optimizer,..)? Model.summary() does not help..
2019-12-07 - EDIT: Thanks to this answer, I conducted a series of tests:
I used the !free command in JupyterLab to measure the available memory before and after each test. Since I get_weights returns a list, I used copy.deepcopy to really copy the objects. Note, the commands below were separate Jupyter cells and the memory comments were added just for this answer.
!free
model = tf.keras.models.load_model('model.hdf5', compile=True)
# 25278624 - 21491888 = 3786.736 MB used
!free
weights = copy.deepcopy(model.get_weights())
# 21491888 - 21440272 = 51.616 MB used
!free
optimizer_weights = copy.deepcopy(model.optimizer.get_weights())
# 21440272 - 21339404 = 100.868 MB used
!free
model2 = tf.keras.models.load_model('model.hdf5', compile=False)
# 21339404 - 21140176 = 199.228 MB used
!free
Loading the model from json:
!free
# loading from json
with open('model_json.json') as f:
model_json_weights = tf.keras.models.model_from_json(f.read())
model_json_weights.load_weights('model_weights.h5')
!free
# 21132664 - 20971616 = 161.048 MB used

The difference between checkpoint and JSON+Weights is in the optimizer:
The checkpoint or model.save() save the optimizer and its weights (load_model compiles the model)
JSON + weights doesn't save the optimizer
Unless you are using a very simple optimizer, it's normal for it to have about the same number of weights as the model (a tensor of "momentum" for each weight tensor, for instance).
Some optimizers might take two times the size of the model, because it has two tensors of optimizer weights for each tensor of model weights.
Saving and loading the optimizer is important if you want to continue training. Starting training again with a new optimizer without proper weights will sort of destroy the model's performance (at least in the beginning).
Now, the 5GB is not really clear to me. But I suppose that:
There should be a lot of compression in saved weights
It might have to do with also allocating memory for all the gradient and backpropagation operations
Interesting tests:
Compression: check how much memory is used by the results of model.get_weights() and model.optimizer.get_weights(). These weights will be numpy, copied from the original tensors
Grandient/Backpropagation: check how much memory is used by:
load_model(name, compile=True)
load_model(name, compile=False)

I had a very similar issue with a recent model that I was attempting to load. All be it this is a year old issue, so I am uncertain if my solution will work for you. However, reading through the keras documentation of saved model.
I found this piece of code to be very useful:
physical_devices = tf.config.list_physical_devices('GPU')
for device in physical_devices:
tf.config.experimental.set_memory_growth(device, True)
Can anyone explain the huge RAM usage, and the difference in size of the models on the hard drive?
It turns out in my case the loaded model was using up all of the GPU memory and causing issues so this forces it to use physical device memory or at least that is my conclusion.

Related

Function CNN model written in Keras 2.2.4 not learning in TensorFlow or Keras 2.4

I am dealing with an object detection problem and using a model which is actually functioning (its results have been published on a paper and I have the original code). Originally, the code was written with Keras 2.2.4 without importing TensorFlow and trained and tested on the same dataset that I am using at the moment. However, when I try to run the same model with TensorFlow 2.x it just won't learn a thing.
I have tried importing everything from TensorFlow 2.4, but I have the same problem if I import everything (layers, models, optimizers...) from Keras 2.4. And I have tried to do so on two different devices, both using a GPU. Namely, what is happening is that the loss function decreases ridiculously fast, but the accuracy won't increase a bit (or, if it does, it gets stuck around 10% or smth). Also, every now and then this happens from an epoch to the next one:
Loss undergoes HUGE jumps between consecutive epochs, and all this without any changes in accuracy
I have tried to train the network on another dataset (had to change the last layers in order to match the required dimensions) and the model seemed to be learning in a normal way, i.e. the accuracy actually increases and the loss doesn't reach 0.0x in one epoch.
I can't post the script, but the model is an Encoder-Decoder network: consecutive Convolutions with increasing number of filters reduce the dimensions of the image, and a specular path of Transposed Convolutions restores the original dimensions. So basically the network only contains:
Conv2D
Conv2DTranspose
BatchNormalization
Activation("relu")
Activation("sigmoid")
concatenate
6 is used to put together outputs from parallel paths or distant layers; 3 and 4 are used after every Conv or ConvTranspose; 5 is only used as final activation function, i.e. as output layer.
I think the problem is pretty generic and I am honestly surprised that I couldn't find a single question about it. What could be happening here? The problem must have something to do with TF/Keras versions, but I can't find any documentation about it and I have been trying to change so many things but nothing changes. It's crazy because if I didn't know that the model works I would try to rewrite it from scratch so I am afraid that this problem may occurr with a new network and I won't be able to understand whether it's the libraries or the model itself.
Thank you in advance! :)
EDIT
Code snippets:
Convolutional block:
encoder1 = Conv2D(filters=first_layer_channels, kernel_size=2, strides=2)(input)
encoder1 = BatchNormalization()(encoder1)
encoder1 = Activation('relu')(encoder1)
Decoder
decoder1 = Conv2DTranspose(filters=first_layer_channels, kernel_size=2, strides=2)(encoder4)
decoder1 = BatchNormalization()(decoder1)
decoder1 = Activation('relu')(decoder1)
Final layers:
final = Conv2D(filters=total, kernel_size=1)(decoder4)
final = BatchNormalization()(final)
Last_Conv = Activation('sigmoid')(final)
The task is human pose estimation: the network (which, I recall, works on this specific task with Keras 2.2.4) has to predict twenty binary maps containing the positions of specific keypoints.

Set batch size of trained keras model to 1

I am having a keras model trained on my own dataset. However after loading weights the summary shows None as the first dimension(the batch size).
I want to know the process to fix the shape to batch size of 1, as it is compulsory for me to fix it so i can convert the model to tflite with GPU support.
What worked for me was to specify batch size to the Input layer, like this:
input = layers.Input(shape=input_shape, batch_size=1, dtype='float32', name='images')
This then carried through the rest of the layers.
The bad news is that despite this "fix" the tfl runtime still complains about dynamic tensors. I get these non-fatal errors in logcat when it runs:
E/tflite: third_party/tensorflow/lite/core/subgraph.cc:801 tensor.data.raw != nullptr was not true.
E/tflite: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors (tensor#26 is a dynamic-sized tensor).
E/tflite: Ignoring failed application of the default TensorFlow Lite delegate indexed at 0.
The good news is that despite these errors it seems to be using the GPU anyway, based on performance testing.
I'm using:
tensorflow-lite-support:0.2.0'
tensorflow-lite-metadata:0.2.1'
tensorflow-lite:2.6.0'
tensorflow:tensorflow-lite-gpu:2.3.0'
Hopefully, they'll fix the runtime so it doesn't matter whether the batch size is 'None'. It shouldn't matter for doing inference.

Train on colab TPU without data from GCP, for data that can be all loaded into memory

From the official TPU documentation, it says that train files must be on GCP
https://cloud.google.com/tpu/docs/troubleshooting#cannot_use_local_filesystem
But I have a smaller dataset (but the training would take a very long time due to the training being based on sampling/permutations) which can be all loaded into memory (1-2 gb). I am wondering if I can somehow just transfer the data objects to the TPU directly, and it can use that to train the files.
If it makes a difference, I am using Keras to do my TPU training.
What I looked at so far:
It seems that you can loaded certain data onto individual TPU cores
self.workers = ['/job:worker/replica:0/task:0/device:TPU:' + str(i) for i in range(num_tpu_cores)]
with tf.device(worker[0):
vecs = vectors[i]
However, I am not sure if this would translate into coordinated training among all the TPU cores.
You can read files with Python:
with open(image_path, "rb") as local_file:
img = local_file.read()
1-2 GB may be too big for TPU. If you are out of memory - split your data to smaller portions.

reduce size of pretrained deep learning model for feature generation

I am using an pretrained model in Keras to generate features for a set of images:
model = InceptionV3(weights='imagenet', include_top=False)
train_data = model.predict(data).reshape(data.shape[0],-1)
However, I have a lot of images and the Imagenet model outputs 131072 features (columns) for each image.
With 200k images I would get an array of (200000, 131072) which is too large to fit into memory.
More importantly, I need to save this array to disk and it would take 100 GB of space when saved as .npy or .h5py
I could circumvent the memory problem by feeding only batches of like 1000 images and saving them to disk, but not the disk space problem.
How can I make the model smaller without losing too much information?
update
as the answer suggested I include the next layer in the model as well:
base_model = InceptionV3(weights='imagenet')
model = Model(input=base_model.input, output=base_model.get_layer('avg_pool').output)
this reduced the output to (200000, 2048)
update 2:
another interesting solution may be the bcolz package to reduce size of numpy arrays https://github.com/Blosc/bcolz
I see at least two solutions to your problem:
Apply a model = AveragePooling2D((8, 8), strides=(8, 8))(model) where model is an InceptionV3 object you loaded (without top). This is the next step in InceptionV3 architecture - so one may easily assume - that these features still hold loads of discriminatory clues.
Apply a some kind of dimensionality reduction (e.g. like PCA) on a sample of data and reduce the dimensionality of all data to get the reasonable file size.

GPU + CPU Tensorflow Training

Setup
I have a network, one whose parameter is a large-embedding matrix (3Million X 300 sized), say embed_mat.
During training, for each mini-batch, I only update a small subset of the vectors from embed_mat (max 15000 vectors) which are chosen using the embedding_lookup op. I am using the Adam optimizer to train my model.
As I cannot store this embed_mat in the GPU, due to its size, I define it under CPU (say /cpu:0) device, but the rest of the parameters of the model, the optimizer etc. are defined under a GPU (say, gpu:/0) device.
Questions
I see that my GPU usage is very minimal (200 MB), which suggests all my training is happening on the CPU. What I expected was that the result of the embedding_lookup is copied to the GPU and all my training happens there. Am I doing something wrong.
The training time is very largely affected by the size (num_vectors) of the embedding matrix which doesn't seem correct to me. In any mini-batch, I only update my network parameters and the vectors I looked up (~15000), so the training time should, if at all, grow sub-linearly with the size of the embedding matrix.
Is there a way to automatically and seamlessly split up my embed_mat to multiple GPUs for faster training?
I suspect the Adam Optimizer for this. Looks like because the embed_mat is on the CPU, all training is happening on the CPU. Is this correct?
Try visualizing on tensorboard where each of your ops is placed. In the "graph" tab you can color by "device". Ideally the embedding variable, the embedding lookup, and the embedding gradient update should be in the CPU, while most other things should be in the GPU.