Tensorflow2.x custom data generator with multiprocessing - tensorflow

I just upgraded to tensorflow 2.3.
I want to make my own data generator for training.
With tensorflow 1.x, I did this:
def get_data_generator(test_flag):
item_list = load_item_list(test_flag)
print('data loaded')
while True:
X = []
Y = []
for _ in range(BATCH_SIZE):
x, y = get_random_augmented_sample(item_list)
X.append(x)
Y.append(y)
yield np.asarray(X), np.asarray(Y)
data_generator_train = get_data_generator(False)
data_generator_test = get_data_generator(True)
model.fit_generator(data_generator_train, validation_data=data_generator_test,
epochs=10000, verbose=2,
use_multiprocessing=True,
workers=8,
validation_steps=100,
steps_per_epoch=500,
)
This code worked fine with tensorflow 1.x. 8 processes were created in the system. The processor and video card were loaded perfectly. "data loaded" was printed 8 times.
With tensorflow 2.3 i got warning:
WARNING: tensorflow: multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
"data loaded" was printed once(should 8 times). GPU is not fully utilized. It also have memory leak every epoch, so traning will stops after several epochs. use_multiprocessing flag did not help.
How to make a generator / iterator in tensorflow(keras) 2.x that can easily be parallelized across multiple CPU processes? Deadlocks and data order are not important.

With a tf.data pipeline, there are several spots where you can parallelize. Depending on how your data are stored and read, you can parallelize reading. You can also parallelize augmentation, and you can prefetch data as you train, so your GPU (or other hardware) is never hungry for data.
In the code below, I have demonstrated how you can parallelize augmentation and add prefetching.
import numpy as np
import tensorflow as tf
x_shape = (32, 32, 3)
y_shape = () # A single item (not array).
classes = 10
# This is tf.data.experimental.AUTOTUNE in older tensorflow.
AUTOTUNE = tf.data.AUTOTUNE
def generator_fn(n_samples):
"""Return a function that takes no arguments and returns a generator."""
def generator():
for i in range(n_samples):
# Synthesize an image and a class label.
x = np.random.random_sample(x_shape).astype(np.float32)
y = np.random.randint(0, classes, size=y_shape, dtype=np.int32)
yield x, y
return generator
def augment(x, y):
return x * tf.random.normal(shape=x_shape), y
samples = 10
batch_size = 5
epochs = 2
# Create dataset.
gen = generator_fn(n_samples=samples)
dataset = tf.data.Dataset.from_generator(
generator=gen,
output_types=(np.float32, np.int32),
output_shapes=(x_shape, y_shape)
)
# Parallelize the augmentation.
dataset = dataset.map(
augment,
num_parallel_calls=AUTOTUNE,
# Order does not matter.
deterministic=False
)
dataset = dataset.batch(batch_size, drop_remainder=True)
# Prefetch some batches.
dataset = dataset.prefetch(AUTOTUNE)
# Prepare model.
model = tf.keras.applications.VGG16(weights=None, input_shape=x_shape, classes=classes)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
# Train. Do not specify batch size because the dataset takes care of that.
model.fit(dataset, epochs=epochs)

Related

How do I distribute datasets between multiple GPUs in Tensorflow 2?

I'm trying to understand how to use multiple gpus to train a model on data too large for the GPU memory. Using tf.distribute.MirroredStrategy seems to copy the full data set to each GPU. What I'm hoping to do is to send a subset of the full dataset to each GPU (2 or 4 gpus) and use MirroredStrategy to reconcile parameter updates on each epoch.
MirroredStrategy.distribute_datasets_from_function() looks promising.
https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy#distribute_datasets_from_function
Problem details:
A fairly complicated multimodal NN with ~200k parameters synthesizing many text, transactional, and structured inputs and with multiple regression and probabilistic outputs. I'm looking at moving development from a single GPU with 24gb memory to cloud compute with multiple 16gb cards on a single node.
The input and targets are currently dictionaries of numpy arrays. I'm hoping for a toy example converting those dictionaries into a distributed data set through to training with different subsets of the full data set assigned to each GPU.
I attempted this:
def build_model(**model_params):
'''
Builds a model from model_params
'''
return tf.keras.Model(
inputs = [MY_INPUT_TENSORS],
output = [MY_OUTPUT_TENSORS])
distributed_strategy = tf.distribute.MirroredStrategy()
with distributed_strategy.scope():
train_model = build_model(**model_params)
train_model.compile(...)
train_model.fit(X_dict, y_dict)
This runs on a 50% sample of the data, but returns OOM on the full sample on 2 GPUs. The full data set appears to be copied to each of the 2 16gb GPUs available. The same model runs with a 100% sample on a single 24gb GPU.
Here's how I got it working with tf.data.Dataset.from_tensor_slices() and tf.distribute.MirroredStrategy.experimental_distribute_dataset():
#Data exists in the form of dictionaries of large numpy arrays
x_train, y_train, x_validation, y_validation = {},{},{},{}
#Create tensorflow datasets using CPU / system memory
with tf.device("CPU"):
train = tf.data.Dataset.from_tensor_slices((x_train, y_train))
valid = tf.data.Dataset.from_tensor_slices((x_validation, y_validation))
batch_size = 1024
epochs = 30
distributed_strategy = tf.distribute.MirroredStrategy()
num_gpu = distributed_strategy.num_replicas_in_sync
#Create a distributed dataset from the tensorflow datasets.
#The data gets streamed to the GPUs, so shuffling, repetition / epoch, and batch
#size need to be manually specified
train = train.shuffle(100*batch_size).repeat(epochs).batch(num_gpu * batch_size, drop_remainder=True)
train_dist = distributed_strategy.experimental_distribute_dataset(train)
valid = valid.repeat(epochs).batch(num_gpu * batch_size, drop_remainder=True)
#Build and compile the model
with distributed_strategy.scope():
train_model = build_model(**model_params)
train_model.compile(
optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate),
loss = losses,
loss_weights = weights )
#Train the model. steps_per_epoch and validation_steps need to be specified.
train_model.fit(
train_dist,
validation_data = valid,
epochs = epochs,
steps_per_epoch = int(len(train)//epochs),
validation_steps = int(len(valid)//epochs),
use_multiprocessing = True,
verbose = 1,
)

Training runs out of memory as RAM consumption keeps growing

I am not sure since when am having this issue and I have to believe that this happened at some point between today and a few months ago but it would seem that the RAM (CPU) consumption grows over time during epochs.
self.model.fit(
train_data,
initial_epoch=self.status.valid_last.epoch,
epochs=train_config.epochs,
steps_per_epoch=train_config.steps_per_epoch,
callbacks=self._get_experiment_callbacks(),
validation_data=valid_data,
validation_steps=train_config.validation_steps,
)
The only thing out of the ordinary here might be the callbacks I am passing but there's actually nothing special here. One is a TensorBoard (TB) callback and the other is a custom Metric which is not doing much except plotting the learning rate and other general metrics to TB.
def _get_experiment_callbacks(self) -> List[tf.keras.callbacks.Callback]:
tensorboard_cb = tf.keras.callbacks.TensorBoard(
log_dir=os.path.join(out_dir, "logs"),
update_freq="epoch",
profile_batch=profile_batch,
write_images=True,
)
# Not interested in whatever is plotted in those
tensorboard_cb.on_epoch_end = lambda *args: ...
tensorboard_cb.on_test_end = lambda *args: ...
return [
tensorboard_cb,
Metrics(tensorboard_cb, update_freq=100),
]
This leaves us with the last suspect which is the valid_data itself. This is essentially just a list of protobuf files (shards) which I am loading like so:
def load_shards(
decode_example_fn: Callable,
shard_fps: List[str],
training: bool,
buffer_size: int = None # 50 * 1000 ** 2,
) -> tf.data.Dataset:
if not len(shard_fps) > 0:
raise ValueError("Argument shard_fps must be a list to shards but is empty.")
def make_dense_(example):
for k, v in example.items():
if isinstance(v, tf.SparseTensor):
example[k] = tf.sparse.to_dense(v)
return example
def load_records_(filenames):
record_dataset = tf.data.TFRecordDataset(filenames, buffer_size=buffer_size)
record_dataset = record_dataset.map(decode_example_fn)
record_dataset = record_dataset.map(make_dense_)
return record_dataset
if not training:
shard_fps = sorted(shard_fps)
dataset = tf.data.Dataset.from_tensor_slices(tf.constant(shard_fps))
options = tf.data.Options()
options.experimental_distribute.auto_shard_policy = tf.data.experimental.AutoShardPolicy.DATA
dataset = dataset.with_options(options)
if training:
dataset = dataset.interleave(load_records_, num_parallel_calls=tf.data.AUTOTUNE, deterministic=False)
else:
dataset = dataset.apply(load_records_)
dataset = dataset.prefetch(tf.data.AUTOTUNE)
return dataset
and from then on there's just preprocessing and transformation mappings on the inputs. So.. I would not expect any memory leak at this point
Still, I am observing a continuous increase of memory consumption over time. The screenshot below shows the consumption after a restart.
At first we use ~28GB of RAM. After 100 steps there's a sharp increase, to ~33GB and from there it kind of seems to stabilize at around 38GB. The next big jump at 216k steps is coming from an evaluation. From there it's just constantly growing ..
From the looks it appears as if the memory usage stabilized and the jump only occurs after each epoch (1 epoch = 6000 steps).
There could be any number of things that could be wrong. TensorBoard could possibly not be reusing the same graph, but instead is adding graphs, which leads to OOM. I don't use TensorBoard myself because I remember this as happening to me a few years back. It's also possible that using model.fit is the problem and that you're loading your data at every epoch. You could try writing the training loop something like:
for epoch in tf.range(epochs):
batch_train_loss = []
batch_train_acc = []
for batch, (X, Y) in train_dataset.enumerate():
train_loss = train_fn(X, Y, model, loss, optimizer, metric, batch) # do the actual training
train_acc = metric.result().numpy() # get the training accuracy
batch_train_loss.append(train_loss) # save the training loss above
batch_train_acc.append(train_acc) # save the training accuracy above
metric.reset_states() # reset the metric after every batch
where the train_fn is:
def get_apply_train_fn():
#tf.function
def train_function(X, Y, model, loss, optimizer, metric, step):
with tf.GradientTape() as tape:
predictions = model(X, training=True)
loss_value = loss(Y, predictions)
gradients = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_acc = metric.update_state(Y, predictions)
return loss_value
return train_function
train_fn = get_apply_train_fn()
Now, this is a stupidly complicated way of writing model.fit, but it does work.
Another way in which I've had to combat OOM on GPU side is use Python's multiprocessing, but this was in a context where I was doing 10-fold cross-validation and the training would crash after 7 or 8 folds with OOM.
Alternatively, you could try turning eager execution on or off with
tf.config.run_functions_eagerly(False) # or True

Training seq2seq model on Google Colab TPU with big dataset - Keras

I'm trying to train a sequence to sequence model for machine translation using Keras on Google Colab TPU.
I have a dataset which I can load in memory but I have to preprocess to it to feed it to the model. In particular I need to convert the target words to one hot vectors and with many examples I can't load the entire conversion in memory, so I need to make batches of data.
I'm using this function as a batch generator:
def generate_batch_bert(X_ids, X_masks, y, batch_size = 1024):
''' Generate a batch of data '''
while True:
for j in range(0, len(X_ids), batch_size):
# batch of encoder and decoder data
encoder_input_data_ids = X_ids[j:j+batch_size]
encoder_input_data_masks = X_masks[j:j+batch_size]
y_decoder = y[j:j+batch_size]
# decoder target and input for teacher forcing
decoder_input_data = y_decoder[:,:-1]
decoder_target_seq = y_decoder[:,1:]
# batch of decoder target data
decoder_target_data = to_categorical(decoder_target_seq, vocab_size_fr)
# keep only with the right amount of instances for training on TPU
if encoder_input_data_ids.shape[0] == batch_size:
yield([encoder_input_data_ids, encoder_input_data_masks, decoder_input_data], decoder_target_data)
The problem is that whenever I try to run the fit function as follows:
model.fit(x=generate_batch_bert(X_train_ids, X_train_masks, y_train, batch_size = batch_size),
steps_per_epoch = train_samples//batch_size,
epochs=epochs,
callbacks = callbacks,
validation_data = generate_batch_bert(X_val_ids, X_val_masks, y_val, batch_size = batch_size),
validation_steps = val_samples//batch_size)
I get the following error:
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:445 make_tensor_proto
raise ValueError("None values not supported.")
ValueError: None values not supported.
Not sure what's wrong and how I can solve this problem.
EDIT
I tried loading less amount of data in memory so that the conversion to one hot encoding of the target words doesn't crash the kernel and it actually works. So there is obviously something wrong on how I generate batches.
It's hard to tell what's wrong since you don't provide your model
definition nor any sample data. However, I'm fairly certain that you're
running into the same
TensorFlow bug
that I recently got bitten by.
The workaround is to use the tensorflow.data API which works much
better with TPUs. Like this:
from tensorflow.data import Dataset
import tensorflow as tf
def map_fn(X_id, X_mask, y):
decoder_target_data = tf.one_hot(y[1:], vocab_size_fr)
return (X_id, X_mask, y[:-1]), decoder_target_data
...
X_ids = Dataset.from_tensor_slices(X_ids)
X_masks = Dataset.from_tensor_slices(X_masks)
y = Dataset.from_tensor_slices(y)
ds = Dataset.zip((X_ids, X_masks, y)).map(map_fn).batch(1024)
model.fit(x = ds, ...)

Using shared variables across sessions in tensorflow

I want to train a model and at the same time use the results of the model for further actions. The training can be done in the background, but I need the prediction model to be available all the time.
I've got an idea to how to do this but not sure if that is possible to do in tensorflow. So I'm thinking of creating separate threads/processes for prediction and training. There will be two different sessions running in each process and they will share the same variables. So, the training model can update the variables in it's own time and the prediction model can use the latest weights for better prediction.
Is there any way to share variable across sessions or some better way to do this? I've heard that it is dicouraged to run multiple sessions in tensorflow.
On same machine can you share session between "predict" and "train" threads? tf.Session().run() calls are thread safe. Here is a working example:
import tensorflow as tf
import numpy as np
import time
import threading
N = 128
input = tf.placeholder(tf.float32, shape=(None, N))
labels = tf.greater_equal(tf.reduce_sum(input, axis=-1, keepdims=True), 0)
l1size = 1024
fc1 = tf.contrib.layers.fully_connected(input, l1size)
l2size=128
fc2 = tf.contrib.layers.fully_connected(fc1, l2size)
predictions = tf.contrib.layers.fully_connected(fc2, 1,
activation_fn=tf.nn.sigmoid)
loss = tf.losses.mean_squared_error(labels, predictions)
train_op = tf.train.AdamOptimizer(learning_rate=0.001).minimize(loss)
session = tf.Session()
session.run(tf.global_variables_initializer())
keep_going = True
def predict_thread(session):
test_data = np.random.randn(10, N)
while keep_going:
current_loss = session.run(loss, feed_dict={input:test_data})
print("Current loss: %f" % current_loss)
time.sleep(1.)
def train_thread(session):
train_data = np.random.randn(1024, N)
while keep_going:
session.run(train_op, feed_dict={input:train_data})
t1 = threading.Thread(target=train_thread, args=(session,))
t2 = threading.Thread(target=predict_thread, args=(session,))
t1.start()
t2.start()
time.sleep(10)
keep_going = False
t1.join()
t2.join()
You can also save/restore your model from time to time if training and prediction are on different machines. This question might be related.

Multiprocessing with GPU in keras

I need to compute multiple deep models in parallel and average their results. My job runs forever after finishing computation with GPU 0.
def model_train(self, params):
from nn_arch import nn_models
X, y, gpu_no = params
print("GPU NO ", gpu_no)
with tf.device('/gpu:' + str(gpu_no)):
model1 = nn_models.lenet5()
early_callback = CustomCallback()
model1.fit(X, y, batch_size=256, validation_split=0.2, callbacks=[early_callback],
verbose=1,
epochs=1)
return model1
And my main method below. In this case I have 2 GPUs
def main(self, X_train, y_train, X_test, y_test):
random_buckets = self.get_random()
X = [X_train[random_buckets[k]] for k in sorted(random_buckets)]
y = [y_train[random_buckets[j]] for j in sorted(random_buckets)]
params = zip(X, y, [0, 1])
models = pool1.map(self.model_train, params)
How do I train multiple models in parallel with Keras. (Data Parallel Approach)
Before compiling the model in keras. Add this line
model = make_parallel(model, 2)
where 2 is the number of GPUs available.
The make_parallel function is available in this file. Just import the file in your code and your code will be executed on multiple GPUs.
https://github.com/kuza55/keras-extras/blob/master/utils/multi_gpu.py
make_parallel is a simple function that:
It instantiates a copy of your model on the N GPUs you tell it to
It splits your batch into N evenly sized smaller batches
It passes each smaller batch into the corresponding model
It concatenates the outputs of the models
Please refer to multi-GPU TensorFlow tutorials as a reference.
https://github.com/tensorflow/tensorflow/blob/r0.7/tensorflow/models/image/cifar10/cifar10_multi_gpu_train.py