Why does JAX + STAX model take more GPU memory than needed? - gpu

I'm trying to run a JAX + STAX model from Kaggle kernels on GPU but it fails due to Out Of Memory Error. I've set the XLA_PYTHON_CLIENT_PREALLOCATE to false to avoid preallocation of GPU memory and also tried setting XLA_PYTHON_CLIENT_ALLOCATOR to platform, nothing helped. The default device is set to CPU from the beginning as I do not want all the data stored on GPU. Model and batch data are sent to GPU manually. The size of the variables (model parameters, data...) souldn't be a problem as the same code runs smoothly on CPU, without OOM errors. I've also made memory profiling of the model. In order to get only GPU memory it was necessary to make another version of the code where GPU is the default device and all the data is stored there. If I ran the profiling on the original code where CPU is default I only get the profiling for CPU data. Batch size reduction to 10 was also necessary for the model to complete training. The profiling shows only the memory needed for storing the data and parameters (≈ 5.5GB), but when I check the GPU usage with other Python functions it is much larger (≈ 14.6GB, Note: when run with batch_size = 100 the memory also hits 14.6GB during the first mini batch but cannot go further).
Here is the simplified version of the code I used:
import os
os.environ["XLA_PYTHON_CLIENT_PREALLOCATE"] = 'false'
# os.environ["XLA_PYTHON_CLIENT_ALLOCATOR"] = 'platform' # Tried this, didn't help
import jax
from jax.lib import xla_bridge
jax.config.update('jax_platform_name', 'cpu') # If not set default device = CPU then all the device arrays will be saved to GPU by default
# Set the processor to GPU if available
try: print('Available GPU Devices: ', jax.devices("gpu")); device = jax.devices("gpu")[0]; gpu_available = 1
except: device = jax.devices("cpu")[0]; gpu_available = 0
# Load data into jax device arrays of dimensions (2000, 200, 200, 3)...
InitializationFunction, ApplyFunction = stax.serial(
Conv(out_chan = 64, filter_shape = (3, 3), strides = (1, 1), padding = 'SAME'), Relu,
Conv(out_chan = 64, filter_shape = (3, 3), strides = (1, 1), padding = 'SAME'), Relu,
Flatten, Dense(128), Relu, Dense(2),)
key = random.PRNGKey(2793)
output_shape, parameters = jax.device_put(InitializationFunction(rng = key, input_shape = (100, image_width, image_height, number_of_channels)), device)
optimizer = optax.adam(0.001)
optimizer_state = jax.device_put(optimizer.init(parameters), device)
def Loss(parameters, inputs, targets):
predictions = ApplyFunction(parameters, inputs)
loss = jnp.mean(optax.softmax_cross_entropy(predictions, targets))
return loss
#jit
def Step(parameters, optimizer_state, inputs, targets):
loss, gradients = value_and_grad(Loss)(parameters, inputs, targets)
updates, optimizer_state = optimizer.update(gradients, optimizer_state, parameters)
parameters = optax.apply_updates(parameters, updates)
return parameters, optimizer_state, loss
epochs, batch_size = 2, 100
key, subkey = random.split(key)
keys_epochs = random.split(subkey, epochs)
for epoch in range(epochs):
random_indices_order = random.permutation(keys_epochs[epoch], jnp.arange(len(train_set['images'])))
for batch_number in range(len(train_set['images']) // batch_size):
start = batch_number * batch_size
end = (batch_number + 1) * batch_size
batch_inputs = jax.device_put(jnp.take(train_set['images'], random_indices_order[start:end], 0), device)
batch_targets = jax.device_put(OneHot(jnp.take(train_set['class_numbers'], random_indices_order[start:end], 0), jnp.max(train_set['class_numbers']) + 1), device)
parameters, optimizer_state, loss = Step(parameters, optimizer_state, inputs = batch_inputs, targets = batch_targets)
My questions are:
Why is more GPU memory used than needed for the size of the variables and more than captured with jax device memory profiling? What is the excess of the memory used for, how to track it and how to prevent it?
How to capture both CPU and GPU memory when doing jax device memory profiling? It only captures CPU when CPU is default device, although GPU is available and in use too.
Here is the result of device memory profiling for GPU when GPU is set to default device and stores the entire dataset (2x(2000, 200, 200, 3) ≈ 1.79GB). Batch size is reduced to 10.
GPU Jax Device Memory profiling for batch size 10

Related

How do I distribute datasets between multiple GPUs in Tensorflow 2?

I'm trying to understand how to use multiple gpus to train a model on data too large for the GPU memory. Using tf.distribute.MirroredStrategy seems to copy the full data set to each GPU. What I'm hoping to do is to send a subset of the full dataset to each GPU (2 or 4 gpus) and use MirroredStrategy to reconcile parameter updates on each epoch.
MirroredStrategy.distribute_datasets_from_function() looks promising.
https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy#distribute_datasets_from_function
Problem details:
A fairly complicated multimodal NN with ~200k parameters synthesizing many text, transactional, and structured inputs and with multiple regression and probabilistic outputs. I'm looking at moving development from a single GPU with 24gb memory to cloud compute with multiple 16gb cards on a single node.
The input and targets are currently dictionaries of numpy arrays. I'm hoping for a toy example converting those dictionaries into a distributed data set through to training with different subsets of the full data set assigned to each GPU.
I attempted this:
def build_model(**model_params):
'''
Builds a model from model_params
'''
return tf.keras.Model(
inputs = [MY_INPUT_TENSORS],
output = [MY_OUTPUT_TENSORS])
distributed_strategy = tf.distribute.MirroredStrategy()
with distributed_strategy.scope():
train_model = build_model(**model_params)
train_model.compile(...)
train_model.fit(X_dict, y_dict)
This runs on a 50% sample of the data, but returns OOM on the full sample on 2 GPUs. The full data set appears to be copied to each of the 2 16gb GPUs available. The same model runs with a 100% sample on a single 24gb GPU.
Here's how I got it working with tf.data.Dataset.from_tensor_slices() and tf.distribute.MirroredStrategy.experimental_distribute_dataset():
#Data exists in the form of dictionaries of large numpy arrays
x_train, y_train, x_validation, y_validation = {},{},{},{}
#Create tensorflow datasets using CPU / system memory
with tf.device("CPU"):
train = tf.data.Dataset.from_tensor_slices((x_train, y_train))
valid = tf.data.Dataset.from_tensor_slices((x_validation, y_validation))
batch_size = 1024
epochs = 30
distributed_strategy = tf.distribute.MirroredStrategy()
num_gpu = distributed_strategy.num_replicas_in_sync
#Create a distributed dataset from the tensorflow datasets.
#The data gets streamed to the GPUs, so shuffling, repetition / epoch, and batch
#size need to be manually specified
train = train.shuffle(100*batch_size).repeat(epochs).batch(num_gpu * batch_size, drop_remainder=True)
train_dist = distributed_strategy.experimental_distribute_dataset(train)
valid = valid.repeat(epochs).batch(num_gpu * batch_size, drop_remainder=True)
#Build and compile the model
with distributed_strategy.scope():
train_model = build_model(**model_params)
train_model.compile(
optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate),
loss = losses,
loss_weights = weights )
#Train the model. steps_per_epoch and validation_steps need to be specified.
train_model.fit(
train_dist,
validation_data = valid,
epochs = epochs,
steps_per_epoch = int(len(train)//epochs),
validation_steps = int(len(valid)//epochs),
use_multiprocessing = True,
verbose = 1,
)

apply dataset.repeat of prefetched dataset

I'm trying to implement AdaRound quantization algorithm and I need to train my layers one by one.
I'm using a dataset with 1024 with batch-size of 32 and I need to iterate over the dataset roughly 312 epochs (or 10k iteration over a batched dataset), I've noticed that the data is copy from the Host to the device every iteration and that the data is not cached on the GPU (despite using the same data repeatedly) - the GPU is idle 30~40% percent of the time
Idle GPU percentage
The data is still copied from the host to the device on advanced iterations:
memcpyH2D chunk in single iteration
I've tried using
tf.data.experimental.prefetch_to_device
tf.data.experimental.copy_to_device
but when I'm iterating over the data after prefetch_to_device or copy_to_device the tensors are stored on the GPU, but if I use repeat to go over the dataset, the tensors are stored on the CPU
I tried using model.fit without dataset.repeat but with multiple epochs, but I get a similar behavior.
I also tried using model.fit with tensors that are stored on the GPU, but the Model's fit converts it to Dataset which forces the data back to the CPU
A code snippet to recreate the issue:
input_shape = (56, 56, 64)
output_shape = (54, 54, 64)
conv = tf.keras.layers.Conv2D(64, (3, 3))
mock_input = tf.keras.layers.Input(input_shape)
mock_output = conv(mock_input)
train_model = tf.keras.Model(inputs=mock_input, outputs=mock_output)
input_data = np.random.rand(1024, *input_shape)
output_data = np.random.rand(1024, *output_shape)
input_dataet = tf.data.Dataset.from_tensor_slices(input_data)
output_dataset = tf.data.Dataset.from_tensor_slices(output_data)
train_model.compile(
optimizer='adam',
loss='mse'
)
train_data = tf.data.Dataset.zip((input_dataet, output_dataset))
batched_train_data = train_data.batch(32).cache()
fetched_train_data = batched_train_data.prefetch(tf.data.AUTOTUNE).repeat()
with tf.profiler.experimental.Profile('logs'):
train_model.fit(fetched_train_data, steps_per_epoch=1024, epochs=1)
Is there a way to apply the dataset.repeat operation on the GPU?
I'm using tensorflow 2.5.2 with python 3.6.9
Detailed answer
nvidia has a package named Nvidia DALI, this packages offers an efficient wrapper to tensorflow's dataset (and more, but this is the relevant feature I used here), I had to install 2 packages - nvidia-dali-cuda110, nvidia-dali-tf-plugin-cuda110 (a detailed installation guide can be found here)
The class I've used is called DALIDataset, to insatiate it properly I first had to initialize pipeline object
single iteration when using properly initialized DALIDataset
Code snippet:
from nvidia.dali.plugin.tf import DALIDataset
from nvidia.dali import pipeline_def, fn
def prep_dataset_dali(dir1, dir2, batch_size):
#pipeline_def(batch_size=batch_size, num_threads=3, device_id=0)
def pipe(path1, path2):
data1 = fn.readers.numpy(device='cpu', file_root=path1, file_filter='*.npy')
data2 = fn.readers.numpy(device='cpu', file_root=path2, file_filter='*.npy')
return data1.gpu(), data2.gpu()
my_pipe = pipe(dir1, dir2)
my_pipe.build()
return DALIDataset(my_pipe, output_dtypes=(tf.float32, tf.float32), output_shapes=((None, 56, 56, 64), (None, 56, 56, 64)))
Note:
external pipeline doesn't work with DALIDataset but it might work with the DALIDatasetWithInputs class from the experimental section

Memory leak with TPUs on GKE causing OOM/"Unavailable: Socket Closed" error

I am using preemptible v2.8 Google Cloud TPUs to perform large-scale hyperparameter optimization. I created the nodes using GKE with tensorflow 2.3 (the latest available version for Cloud TPUs.) Unfortunately, I keep encountering a memory leak on the TPU nodes during the search. This memory leak seems to ultimately cause an "Unavailable: Socket Closed" error (or sometimes an OOM error), where the TPU becomes unable to perform any additional training or evaluation even after re-deploying the code. The problem does not occur when I test my code on either a CPU or GPU.
This problem only occurs on the TPU worker node, but not the controller CPU. (At one point, I had been encountering another memory leak on the CPU due to a buildup of old models and unnecessary operations on the computation graph.) Methods such as tf.backend.clear_session() and del model resolved the memory leak with the CPU, but it persists on the TPU. Here is a graph of the TPU runtime memory usage (the decrease in memory at the end appears to occur after the TPU disconnects because GKE deletes it automatically):
Ultimately, as the used memory increases on the TPU, I get the following error:
raise_from tensorflow.python.framework.errors_impl.ResourceExhaustedError: 9 root error(s) found.
Error
2021-08-02T16:36:47.652282141ZHint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
Error
2021-08-02T16:36:47.652288611Z
Error
2021-08-02T16:36:47.652296423Z (4) Resource exhausted: {{function_node __inference_train_function_37854}} Attempting to reserve 3.27G at the bottom of memory. That was not possible. There are 3.48G free, 0B reserved, and 1.67G reservable.
Error
2021-08-02T16:36:47.652313550Z [[{{node cluster_train_function/_execute_4_0}}]]
2021-08-02T16:36:47.652921274Z0 successful operations.
Error
2021-08-02T16:36:47.654639274Z0 derived errors ignored.
Occasionally, I instead get an "Unavailable: Socket Closed" error or an "Unable to destroy remote tensor handles" error.
This error typically only occurs after training several networks. I tried multiple methods suggested by other posts to fix the error, such as typecasting my data to float32, not caching my dataset into memory, using a smaller mini batch size to decrease memory consumption, and using "from_logits=True" in my cost function. I even tried using multiprocessing to perform the network training so memory would be cleared after each network evaluation, but for some reason, the Cloud TPU fails to execute any of the for loops in my code or in the training code (a problem I did not have with either a GPU or CPU, cloud or otherwise.) Larger networks seem to cause the problem to occur much more quickly than smaller networks, which suggests to me that old, unused models are still kept in memory on the TPU. Is there any way to clear the memory on the TPU or reset its state to stop this memory leak?
Here is an MVE I wrote to duplicate the problem:
import os
import gc
import sys
import random
import numpy as np
import tensorflow as tf
from sklearn import metrics
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import InputLayer, Conv2D, Flatten, Dense
from tensorflow.keras.optimizers import Adam
h = 128
w = 128
channels = 1
mini_batch_size = 256
epochs = 15
using_tpu = True
if using_tpu:
## Get tpu name from arguments
tpu_name = sys.argv[1]
tpu_name = tpu_name.replace('--tpu=', '')
## Initialize TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_name) # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.TPUStrategy(tpu)
def create_network():
strategy = tf.distribute.TPUStrategy(tpu)
with strategy.scope():
## Create random data
x_train = np.random.randn(1024, 128, 128, 1).astype('float32') # astype necessary to help prevent Connect to Socket Error
y_train = np.random.randn(1024, 50).astype('float32')
x_test = np.random.randn(256, 128, 128, 1).astype('float32')
y_test = np.random.randn(256, 50).astype('float32')
model = Sequential()
model.add(InputLayer((h, w, channels)))
layers = 5
ks = [np.random.choice([3, 5, 7]) for l in range(layers)]
filters = [np.random.choice([64, 128, 256]) for l in range(layers)]
for l in range(layers):
model.add(
Conv2D(kernel_size=(ks[l], ks[l]), padding='same',
filters=filters[l], name='conv' + str(l), activation='relu'))
model.add(Flatten())
# Softmax output layer
model.add(Dense(50)) # Don't need softmax activation because from_logits performs that operation automatically
lr = 0.001
opt = Adam(learning_rate=lr, decay=1e-6)
model.compile(optimizer=opt, loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True), metrics=['accuracy'])
model.fit(x_train, y_train, epochs=epochs, batch_size=mini_batch_size, shuffle=True, verbose=1)
##### memory leak also occurs with dataset API:
'''
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(mini_batch_size,
drop_remainder=True)
model.fit(train_dataset, epochs=epochs, verbose=1, shuffle=shuffle,
steps_per_epoch=len(x_train) // mini_batch_size)
'''
#######
y_pred = model(x_test)
## Attempt to clear memory
print(gc.collect())
del model
tf.keras.backend.clear_session()
while True:
create_network()
Thank you so much! Please let me know if I should include any other information.
A few things:
Your error message:
(4) Resource exhausted: {{function_node __inference_train_function_37854}} Attempting to reserve 3.27G at the bottom of memory. That was not possible. There are 3.48G free, 0B reserved, and 1.67G reservable.
indicates an HBM OOM rather than a memory OOM. Basically the TPU has its own set of memory on the chips - in this case you've exhausted that memory. If it was an OOM (like RAM OOM), then you will likely see the SocketClosed error, which you saw as well.
That being said, what are your options? I suggest you go with the tf.data approach but with a few modifications:
def get_dataset(is_training: bool):
def generate_data(_):
return tf.random.normal([128, 128, 1], dtype=tf.bfloat16)
dataset = tf.data.Dataset.range(1)
dataset = dataset.repeat()
dataset = dataset.map(generate_data, num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.repeat().batch(mini_batch_size, drop_remainder=is_training)
train_dataset = get_dataset(is_training=True)
eval_dataset = get_dataset(is_training=False)
In this example we can use bfloat16 which reduces the memory footprint on HBM, but you may need to further reduce your minibatch size from 1024 to 512. Alternatively you can go up from v2-8 to v3-8 which has 2x the HBM. I'm not sure if the numpy based method contributes to the OOMs/SocketClosed errors you see, but I don't think this approach should run into that. Of course you'll eventually use real data, and in that case you should use tf.data for optimal performance. More info here.
IIUC tf.backend.clear_session() and gc.collect() will only clear memory on your host VM, not on the TPU server.
PS: You can also use the steps_per_execution flag to further improve the performance. Please see here for more info. Basically this prevents execution from continuously switching from CPU to TPU every step. If you set this to equal the number of training steps in an epoch, this will give you the best throughput.

Smaller speedup than expected by precomputing encoded output in full pairwise comparison

I am building a neural net to predict the outcome of pairwise comparison. The same encoder network is applied on both inputs before merging and computing the result in the downstream part. In my use case, I am computing all the pairwise predictions for a given set of elements, as such the number of predictions grows very quickly, thus I am interested in speeding-up the prediction process.
Doing the complete set of pairwise predictions naively involves computing the result of the encoder network on each of the element over and over again. Since the encoder network is bigger than the downstream part (merging + layers downstream), I thought that precomputing the result of the encoder network on each input element and then only computing the downstream on these encoded values would lead to significant speed-up. However that is not really what I find in practice. For the example below both on Colab (CPU) and on my machine (CPU), I can get savings of 10-15% of runtimes when I would have expected something like 50% if you think in terms of layers and even more if you think in terms of parameters.
I feel like I am missing something, either in the implementation or that tensorflow/keras already does some kind of magic (caching?) given the structure of the network thus leading to smaller gains?
import numpy as np # numpy will be used for mgrid to compute all the pairs of the input
import tensorflow as tf
# Encoder Network
input_a = tf.keras.Input(shape=(10,4))
x = tf.keras.layers.Flatten()(input_a)
x = tf.keras.layers.Dense(100, activation='relu')(x)
x = tf.keras.layers.Dense(20, activation='relu')(x)
x = tf.keras.layers.Dense(10, activation='relu')(x)
upstream_network = tf.keras.Model(input_a, x)
# Downstream network, from merge to final prediction
input_downstream_a = tf.keras.Input(shape = upstream_network.layers[-1].output_shape[1:])
input_downstream_b = tf.keras.Input(shape = upstream_network.layers[-1].output_shape[1:])
x = tf.keras.layers.subtract([input_downstream_a, input_downstream_b])
x = tf.keras.layers.Dense(20, activation='relu')(x)
x = tf.keras.layers.Dense(1, activation='sigmoid')(x)
downstream_network = tf.keras.Model((input_downstream_a, input_downstream_b), x)
# Full network
input_full_a = tf.keras.Input(shape=(10,4))
input_full_b = tf.keras.Input(shape=(10,4))
intermed_a = upstream_network(input_full_a)
intermed_b = upstream_network(input_full_b)
res = downstream_network([intermed_a, intermed_b])
full_network = tf.keras.Model([input_full_a, input_full_b], res)
full_network.compile(loss='binary_crossentropy')
# Experiment
population = np.random.random((300, 10, 4))
# %%timeit 10
# 1.9s on Colab CPU
indices = np.mgrid[range(population.shape[0]), range(population.shape[0])].reshape(2, -1)
full_network.predict([population[indices[0]], population[indices[1]]])
# %%timeit 10
# 1.7s on Colab CPU
out = upstream_network.predict(population)
indices = np.mgrid[range(population.shape[0]), range(population.shape[0])].reshape(2, -1)
downstream_network.predict([out[indices[0]], out[indices[1]]])
First you are not going to be able to test time, using small input population, you may try bigger input size 600, 700, 800, but even though, the prediction time not going to increase a lot.
In your case I suggest using predict_on_batch rather than predict, as it's not going to split your input to n batches, which a time consuming task, and predict_on_batch is more reasonable in case of your data could be fitted into google colab memory
full_network.predict_on_batch([population[indices[0]], population[indices[1]]])
Using your test case (300, 10, 4) - predict_on_batch
array([[0.5 ],
[0.5022318 ],
[0.47754446],
...,
[0.50507313],
[0.4884554 ],
[0.5 ]], dtype=float32)
time: 216 ms

Adding more layers to tensorflow MNIST tutorial makes accuracy drop and sometimes accuracy remains constant over iteration for batch

i was checking this tutorial for deep learning ,he made a simply nueral network with one hidden layer. i did same and it was working fine(accuracy 94%) ,now i added one more layer and its accuracy got decreaed to (10%) i dont know why?
Below is my code
`import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
sess = tf.InteractiveSession()
mnist = input_data.read_data_sets("MNIST_data/",one_hot=True)
input_images = tf.placeholder(tf.float32,shape=[None,784])
target_labels = tf.placeholder(tf.float32,shape=[None,10])
hidden_nodes1 = 512
hidden_nodes2 = 256
input_weights = tf.Variable(tf.truncated_normal([784,hidden_nodes1]))
input_biases = tf.Variable(tf.zeros([hidden_nodes1]))
hidden_weights1 = tf.Variable(tf.truncated_normal([hidden_nodes1,hidden_nodes2]))
hidden_biases1 = tf.Variable(tf.zeros([hidden_nodes2]))
hidden_weights2 = tf.Variable(tf.truncated_normal([hidden_nodes2,10]))
hidden_biases2 = tf.Variable(tf.zeros([10]))
input_layer = tf.matmul(input_images,input_weights)
hidden_layer1 = tf.nn.relu(input_layer + input_biases)
hidden_layer2 = tf.nn.relu(tf.matmul(hidden_layer1,hidden_weights1) + hidden_biases1)
digits_weights = tf.matmul(hidden_layer2,hidden_weights2)+hidden_biases2
loss_funtion = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=digits_weights,labels=target_labels))
optimizer = tf.train.GradientDescentOptimizer(0.2).minimize(loss_funtion)
correct_prediction = tf.equal(tf.argmax(digits_weights,1),tf.argmax(target_labels,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction,tf.float32))
tf.global_variables_initializer().run()
for x in range(2000):
batch = mnist.train.next_batch(100)
optimizer.run(feed_dict={input_images:batch[0],target_labels:batch[1]})
if ((x+1)%100==0):
print("Training Epoc"+str(x+1))
print("Accuracy"+str(accuracy.eval(feed_dict={input_images:mnist.test.images,target_labels:mnist.test.labels})))`
Your code is actually fine. But, by adding a new hidden layer with 256 nodes, you are dramatically increasing the number of learnable parameters! Your model architecture, essentially, has become too big. This is what I propose, you can reduce the number of nodes from 512 and 256 to something like 128 or max 256 for both. Then, use a much lower value for the learning rate as the current learning rate is too high and may not converge on a minima properly (or may even diverge). So, I'd change it to something like 0.01 or even lower instead. Another thing you could try is to use the AdamOptimizer instead of the GradientDescentOptimizer. Try these and the code should work fine!