How do I distribute datasets between multiple GPUs in Tensorflow 2? - tensorflow

I'm trying to understand how to use multiple gpus to train a model on data too large for the GPU memory. Using tf.distribute.MirroredStrategy seems to copy the full data set to each GPU. What I'm hoping to do is to send a subset of the full dataset to each GPU (2 or 4 gpus) and use MirroredStrategy to reconcile parameter updates on each epoch.
MirroredStrategy.distribute_datasets_from_function() looks promising.
https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy#distribute_datasets_from_function
Problem details:
A fairly complicated multimodal NN with ~200k parameters synthesizing many text, transactional, and structured inputs and with multiple regression and probabilistic outputs. I'm looking at moving development from a single GPU with 24gb memory to cloud compute with multiple 16gb cards on a single node.
The input and targets are currently dictionaries of numpy arrays. I'm hoping for a toy example converting those dictionaries into a distributed data set through to training with different subsets of the full data set assigned to each GPU.
I attempted this:
def build_model(**model_params):
'''
Builds a model from model_params
'''
return tf.keras.Model(
inputs = [MY_INPUT_TENSORS],
output = [MY_OUTPUT_TENSORS])
distributed_strategy = tf.distribute.MirroredStrategy()
with distributed_strategy.scope():
train_model = build_model(**model_params)
train_model.compile(...)
train_model.fit(X_dict, y_dict)
This runs on a 50% sample of the data, but returns OOM on the full sample on 2 GPUs. The full data set appears to be copied to each of the 2 16gb GPUs available. The same model runs with a 100% sample on a single 24gb GPU.

Here's how I got it working with tf.data.Dataset.from_tensor_slices() and tf.distribute.MirroredStrategy.experimental_distribute_dataset():
#Data exists in the form of dictionaries of large numpy arrays
x_train, y_train, x_validation, y_validation = {},{},{},{}
#Create tensorflow datasets using CPU / system memory
with tf.device("CPU"):
train = tf.data.Dataset.from_tensor_slices((x_train, y_train))
valid = tf.data.Dataset.from_tensor_slices((x_validation, y_validation))
batch_size = 1024
epochs = 30
distributed_strategy = tf.distribute.MirroredStrategy()
num_gpu = distributed_strategy.num_replicas_in_sync
#Create a distributed dataset from the tensorflow datasets.
#The data gets streamed to the GPUs, so shuffling, repetition / epoch, and batch
#size need to be manually specified
train = train.shuffle(100*batch_size).repeat(epochs).batch(num_gpu * batch_size, drop_remainder=True)
train_dist = distributed_strategy.experimental_distribute_dataset(train)
valid = valid.repeat(epochs).batch(num_gpu * batch_size, drop_remainder=True)
#Build and compile the model
with distributed_strategy.scope():
train_model = build_model(**model_params)
train_model.compile(
optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate),
loss = losses,
loss_weights = weights )
#Train the model. steps_per_epoch and validation_steps need to be specified.
train_model.fit(
train_dist,
validation_data = valid,
epochs = epochs,
steps_per_epoch = int(len(train)//epochs),
validation_steps = int(len(valid)//epochs),
use_multiprocessing = True,
verbose = 1,
)

Related

Training with Dataset API and numpy array yields completely different results

I have a CNN regression model and feature comes in (2000, 3000, 1) shape, where 2000 is total number of samples with each being a (3000, 1) 1D array. Batch size is 8, 20% of the full dataset is used for validation.
However, zip feature and label into tf.data.Dataset gives completely different scores from feeding numpy arrays directly in.
The tf.data.Dataset code looks like:
# Load features and labels
features = np.array(features) # shape is (2000, 3000, 1)
labels = np.array(labels) # shape is (2000,)
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(buffer_size=2000)
dataset = dataset.batch(8)
train_dataset = dataset.take(200)
val_dataset = dataset.skip(200)
# Training model
model.fit(train_dataset, validation_data=val_dataset,
batch_size=8, epochs=1000)
The numpy code looks like:
# Load features and labels
features = np.array(features) # exactly the same as previous
labels = np.array(labels) # exactly the same as previous
# Training model
model.fit(x=features, y=labels, shuffle=True, validation_split=0.2,
batch_size=8, epochs=1000)
Except for this, other code is exactly the same, for example
# Set global random seed
tf.random.set_seed(0)
np.random.seed(0)
# No preprocessing of feature at all
# Load model (exactly the same)
model = load_model()
# Compile model
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss=tf.keras.losses.MeanSquaredError(),
metrics=[tf.keras.metrics.mean_absolute_error, ],
)
The former method via tf.data.Dataset API yields mean absolute error (MAE) around 10-3 on both training and validation set, which looks quite suspicious as the model doesn't have any drop-out or regularization to prevent overfitting. On the other hand, feeding numpy arrays right in gives training MAE around 0.1 and validation MAE around 1.
The low MAE of tf.data.Dataset method looks super suspicious however I just couldn't figure out anything wrong with the code. Also I could confirm the number of training batches is 200 and validation batches is 50, meaning I didn't use the training set for validation.
I tried to vary the global random seed or use some different shuffle seeds, which didn't change the results much. Training was done on NVIDIA V100 GPUs, and I tried tensorflow version 2.9, 2.10, 2.11 which didn't make much difference.
The problem lies in the default behaviour of "shuffle" method of tf.data.Dataset, more specificially the reshuffle_each_iteration argument which is by default True. Meaning if I implement the following code:
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(buffer_size=2000)
dataset = dataset.batch(8)
train_dataset = dataset.take(200)
val_dataset = dataset.skip(200)
model.fit(train_dataset, validation_data=val_dataset, batch_size=8, epochs=1000)
The dataset would actually be shuffle after each epoch though it might not look so apparently so. As a result, the validation data would leak into training set (in fact there would be no distinguish between these two sets as the order is shuffled every epoch).
So make sure to set reshuffle_each_iteration to False if you would like to shuffle the dataset and then do train-val split.
UPDATE: TensorFlow confirms this issue and warning would be added in future docs.
PS: It's a hard lesson for me, as I have been using the model for analysing the results for several months (as a graduating MPhil student).

Training seq2seq model on Google Colab TPU with big dataset - Keras

I'm trying to train a sequence to sequence model for machine translation using Keras on Google Colab TPU.
I have a dataset which I can load in memory but I have to preprocess to it to feed it to the model. In particular I need to convert the target words to one hot vectors and with many examples I can't load the entire conversion in memory, so I need to make batches of data.
I'm using this function as a batch generator:
def generate_batch_bert(X_ids, X_masks, y, batch_size = 1024):
''' Generate a batch of data '''
while True:
for j in range(0, len(X_ids), batch_size):
# batch of encoder and decoder data
encoder_input_data_ids = X_ids[j:j+batch_size]
encoder_input_data_masks = X_masks[j:j+batch_size]
y_decoder = y[j:j+batch_size]
# decoder target and input for teacher forcing
decoder_input_data = y_decoder[:,:-1]
decoder_target_seq = y_decoder[:,1:]
# batch of decoder target data
decoder_target_data = to_categorical(decoder_target_seq, vocab_size_fr)
# keep only with the right amount of instances for training on TPU
if encoder_input_data_ids.shape[0] == batch_size:
yield([encoder_input_data_ids, encoder_input_data_masks, decoder_input_data], decoder_target_data)
The problem is that whenever I try to run the fit function as follows:
model.fit(x=generate_batch_bert(X_train_ids, X_train_masks, y_train, batch_size = batch_size),
steps_per_epoch = train_samples//batch_size,
epochs=epochs,
callbacks = callbacks,
validation_data = generate_batch_bert(X_val_ids, X_val_masks, y_val, batch_size = batch_size),
validation_steps = val_samples//batch_size)
I get the following error:
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:445 make_tensor_proto
raise ValueError("None values not supported.")
ValueError: None values not supported.
Not sure what's wrong and how I can solve this problem.
EDIT
I tried loading less amount of data in memory so that the conversion to one hot encoding of the target words doesn't crash the kernel and it actually works. So there is obviously something wrong on how I generate batches.
It's hard to tell what's wrong since you don't provide your model
definition nor any sample data. However, I'm fairly certain that you're
running into the same
TensorFlow bug
that I recently got bitten by.
The workaround is to use the tensorflow.data API which works much
better with TPUs. Like this:
from tensorflow.data import Dataset
import tensorflow as tf
def map_fn(X_id, X_mask, y):
decoder_target_data = tf.one_hot(y[1:], vocab_size_fr)
return (X_id, X_mask, y[:-1]), decoder_target_data
...
X_ids = Dataset.from_tensor_slices(X_ids)
X_masks = Dataset.from_tensor_slices(X_masks)
y = Dataset.from_tensor_slices(y)
ds = Dataset.zip((X_ids, X_masks, y)).map(map_fn).batch(1024)
model.fit(x = ds, ...)

Tensorflow2.x custom data generator with multiprocessing

I just upgraded to tensorflow 2.3.
I want to make my own data generator for training.
With tensorflow 1.x, I did this:
def get_data_generator(test_flag):
item_list = load_item_list(test_flag)
print('data loaded')
while True:
X = []
Y = []
for _ in range(BATCH_SIZE):
x, y = get_random_augmented_sample(item_list)
X.append(x)
Y.append(y)
yield np.asarray(X), np.asarray(Y)
data_generator_train = get_data_generator(False)
data_generator_test = get_data_generator(True)
model.fit_generator(data_generator_train, validation_data=data_generator_test,
epochs=10000, verbose=2,
use_multiprocessing=True,
workers=8,
validation_steps=100,
steps_per_epoch=500,
)
This code worked fine with tensorflow 1.x. 8 processes were created in the system. The processor and video card were loaded perfectly. "data loaded" was printed 8 times.
With tensorflow 2.3 i got warning:
WARNING: tensorflow: multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
"data loaded" was printed once(should 8 times). GPU is not fully utilized. It also have memory leak every epoch, so traning will stops after several epochs. use_multiprocessing flag did not help.
How to make a generator / iterator in tensorflow(keras) 2.x that can easily be parallelized across multiple CPU processes? Deadlocks and data order are not important.
With a tf.data pipeline, there are several spots where you can parallelize. Depending on how your data are stored and read, you can parallelize reading. You can also parallelize augmentation, and you can prefetch data as you train, so your GPU (or other hardware) is never hungry for data.
In the code below, I have demonstrated how you can parallelize augmentation and add prefetching.
import numpy as np
import tensorflow as tf
x_shape = (32, 32, 3)
y_shape = () # A single item (not array).
classes = 10
# This is tf.data.experimental.AUTOTUNE in older tensorflow.
AUTOTUNE = tf.data.AUTOTUNE
def generator_fn(n_samples):
"""Return a function that takes no arguments and returns a generator."""
def generator():
for i in range(n_samples):
# Synthesize an image and a class label.
x = np.random.random_sample(x_shape).astype(np.float32)
y = np.random.randint(0, classes, size=y_shape, dtype=np.int32)
yield x, y
return generator
def augment(x, y):
return x * tf.random.normal(shape=x_shape), y
samples = 10
batch_size = 5
epochs = 2
# Create dataset.
gen = generator_fn(n_samples=samples)
dataset = tf.data.Dataset.from_generator(
generator=gen,
output_types=(np.float32, np.int32),
output_shapes=(x_shape, y_shape)
)
# Parallelize the augmentation.
dataset = dataset.map(
augment,
num_parallel_calls=AUTOTUNE,
# Order does not matter.
deterministic=False
)
dataset = dataset.batch(batch_size, drop_remainder=True)
# Prefetch some batches.
dataset = dataset.prefetch(AUTOTUNE)
# Prepare model.
model = tf.keras.applications.VGG16(weights=None, input_shape=x_shape, classes=classes)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
# Train. Do not specify batch size because the dataset takes care of that.
model.fit(dataset, epochs=epochs)

Keras/Tensorflow - Generate predictions in batch for imagenet (I get only one result back)

I am generating imagenet tags for all keyframes in a video with a single call and have this code:
# all keras/tf/mobilenet imports
model_imagenet = MobileNetV2(weights='imagenet')
frames_list = []
for frame in frame_set:
frame_img = frame.to_image()
frame_pil = frame_img.resize((224,224), Image.ANTIALIAS)
ts = int(frame.pts)
x = image.img_to_array(frame_pil)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
frames_list.append(x)
print(len(frames_list))
preds_list = model_imagenet.predict_on_batch(frames_list)
print("[*]",preds_list)
The result appears thus:
frames_list count: 125
and the predictions thus, one row of 1000 dimensions (imagenet classes), shouldn't it be 125 rows?:
[[1.15425530e-04 1.83317825e-04 4.28701424e-05 2.87547664e-05
:
7.91769926e-05 1.30803732e-04 4.81895368e-05 3.06891889e-04]]
This is generating prediction for a single row in the batch. I have tried both predict and predict_on_batch with the same result.
How can I get a bulk prediction for say 200 frames at one go with Keras/Tensorflow/Mobilenet?
ImageNet is a popular database which consists of 1000 different categories.
The dimension of 1000 is natural and to be expected, since for one image the softmax outputs a probability for each of the 1000 classes.
EDIT: For multiple image predictions, you should use predict_generator(). In addition, as of TensorFlow 2.0, if you use the Keras backend, predict_generator() has been deprecated in favor of simple predict, which also allows input data as generators.
E.g. : (from How to use predict_generator with ImageDataGenerator?) :
test_datagen = ImageDataGenerator(rescale=1./255)
#Modify the batch size here
test_generator = test_datagen.flow_from_directory(
test_dir,
target_size=(200, 200),
color_mode="rgb",
shuffle = False,
class_mode='categorical',
batch_size=1)
filenames = test_generator.filenames
nb_samples = len(filenames)
predict = model.predict_generator(test_generator,steps = nb_samples)
Please bear in mind that it will be highly unlikely to have a lot of predictions at once, since it is constrained to the memory of the video card.
Also, note the difference between predict and predict_on_batch: What is the difference between the predict and predict_on_batch methods of a Keras model?
OK, here is how I solved it, hope this helps someone else:
preds_list = model_imagenet.predict(np.vstack(frames_list),batch_size=32)
print("[*]",preds_list)
Please note the np.vstack and adjust the batch_size to whatever your computer is capable of.

Keras fit_generator using a lot of memory even with small batch sizes

Previously I manually trained my model using model.fit() inside a for loop to train it on small batches of data, due to memory constraints. The problem with this is that I can't have access to all previous histories through history.history, because it's like each time a new model is trained, and previous histories aren't stored anywhere.
When I use model.fit() on a 500 batch size, around 7 GB of my ram gets full. I use keras with tensorflow-cpu back end.
But when I use a generator, even with a batch size of 50 won't fit in memory, and gets swapped onto the disk.
I'm performing classification, using 224*224 images, and I am trying to fine tune vgg face. I'm using vgg face implemented according to this link:
VGG-Face
I'm using ResNet and SeNet architectures, as described in the link.
I've previously shuffled my data. I've put aside %20 of my data for test.
My data, image addresses and labels, are stored in a list. The %20 of my training data will be used for validation. For example if batch size is equal to 50, train_data_generator will create a batch with size 40 from the first %80 portion of training data, and vl_data_generator will create a batch with size 10 from the last %20 portion of training data. I've written a class, and by creating an instance and invoking train method
through it, I perform training. Here are generator and training parts of my code, excluding model definitions:
def prepare_input_data(self, batch_addresses):
image = []
for j in range(len(batch_addresses)):
img = cv2.imread(batch_addresses[j])
img = cv2.resize(img, (224, 224))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = img - np.array([103.939, 116.779, 123.68])
image.append(img)
data = np.array(image)
data = data.astype('float32')
data /= 255
return data
def train_data_generator(self, addresses, labels, batch_size):
"""Train data generator"""
#Use first %80 of data for training.
addresses = addresses[: int(0.8 * len(addresses))]
labels = labels[: int(0.8 * len(labels))]
total_data = len(addresses)
while 1:
for i in range(total_data / batch_size):
batch_addresses = addresses[i * batch_size: (i + 1) * batch_size]
batch_labels = labels[i * batch_size: (i + 1) * batch_size]
data = self.prepare_input_data(batch_addresses)
batch_labels = np_utils.to_categorical(batch_labels, self.nb_class)
yield data, batch_labels
def val_data_generator(self, addresses, labels, batch_size):
"""Validation data generator"""
#Use the last %20 of data for validation
addresses = addresses[int(0.8 * len(addresses)):]
labels = labels[int(0.8 * len(labels)):]
total_data = len(addresses)
image = []
while 1:
for i in range(total_data / batch_size):
batch_addresses = addresses[i * batch_size: (i + 1) * batch_size]
batch_labels = labels[i * batch_size: (i + 1) * batch_size]
data = self.prepare_input_data(batch_addresses)
batch_labels = np_utils.to_categorical(batch_labels, self.nb_class)
yield data, batch_labels
def train(self, label_interested_in):
"""Trains the model"""
#Read training data from json file, and get addresses and labels
addresses, labels = self.create_address_and_label(label_interested_in)
batch_size = 50
train_batch_size = 40
val_batch_size = 10
steps = int(len(addresses) / batch_size) + 1
print(len(addresses), steps)
#Perform training
history = self.custom_vgg_model.fit_generator(
self.train_data_generator(addresses, labels, train_batch_size),
steps_per_epoch=steps, epochs=self.number_of_epochs,
verbose=1, validation_data=self.val_data_generator(addresses, labels, val_batch_size),
validation_steps=steps, initial_epoch=0)
Why am I seeing such high memory usage? Is it because the way generators work in keras? I read that generators prepare batches beforehand to speedup the training process by running in parallel with the training. Or am I doing something wrong?
As a side question, since there isn't a batch_size argument in fit_generator(), am I correct in assuming that data gets loaded into the model based on generators and gradient updates are performed after each training and validation batch is loaded?
Try workers=0
This will not invoke any multiprocessing which is intended to fill up the queue beforehand up to the max_queue_size argument with using k workers.
What this does is; prepare a queue of generated data on CPU while training is ongoing on GPU so no time is lost and avoid bottlenecks.
For your need workers=0 will work
For deeper inquiry refer to
keras fit_generator