Tensorflow Version : 2.1.0
Model built using tf.keras
Graphics Card : Nvidia GTX 1660TI 6GB DDR6
CPU : Intel i7 9th Gen
Ram : 16 GB DDR4
Storage Disk : SSD (NVME)
I wrote a code to read audio files in batches in a multithread manner using tf.keras.Sequences with multiple workers, but the issue with that code is the CPU is not concurrently reading the next set of audio batches while training the GPU due to which the GPU is being only utilized upto 30 percent of its max capacity (Training time for an epoch is around 25 minutes).
So I decided to move to tf.data.Datasets.from_generator to use the existing generator function to read the batches in a more efficient manner. But that input pipeline is performing more bad (taking 47 minutes to train an epoch). I have attached the code that I used to read create the input pipeline. I have read the file names and their categories from an excel file and fed them to the generator and created the pipeline.
Even after after applying prefetch the pipeline was performing really worse.
Since this is the first time that I am using tf.data API, I would like some insights if I have made any mistakes or not.
This is my code to generate the batches.
# Function read the audio files
def get_x(file):
data = []
for i in file:
audio, fs = sf.read(i, dtype="float32")
data.append(audio[::2])
data = np.array(data, dtype=np.float32)
data = np.expand_dims(data, axis=-1)
return data
def data_generator(files, labels, batchsize):
while True:
start = 0
end = batchsize
while start < len(files):
x = get_x(files[start:end])
y = np.array(tf.keras.utils.to_categorical(labels[start:end], num_classes=2), dtype=np.float32)
yield x, y
start += batchsize
end += batchsize
# Get the tensorflow data dataset object to generate batches
def tf_data_dataset(files, labels, batch_size):
autotune = tf.data.experimental.AUTOTUNE
dataset = tf.data.Dataset.from_generator(
data_generator,
output_types=(np.float32, np.float32),
output_shapes=(tf.TensorShape([None, 16000, 1]),
tf.TensorShape([None, 2])),
args=(files, labels, batch_size))
dataset = dataset.prefetch(buffer_size=autotune)
return dataset
Related
I'm trying to understand how to use multiple gpus to train a model on data too large for the GPU memory. Using tf.distribute.MirroredStrategy seems to copy the full data set to each GPU. What I'm hoping to do is to send a subset of the full dataset to each GPU (2 or 4 gpus) and use MirroredStrategy to reconcile parameter updates on each epoch.
MirroredStrategy.distribute_datasets_from_function() looks promising.
https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy#distribute_datasets_from_function
Problem details:
A fairly complicated multimodal NN with ~200k parameters synthesizing many text, transactional, and structured inputs and with multiple regression and probabilistic outputs. I'm looking at moving development from a single GPU with 24gb memory to cloud compute with multiple 16gb cards on a single node.
The input and targets are currently dictionaries of numpy arrays. I'm hoping for a toy example converting those dictionaries into a distributed data set through to training with different subsets of the full data set assigned to each GPU.
I attempted this:
def build_model(**model_params):
'''
Builds a model from model_params
'''
return tf.keras.Model(
inputs = [MY_INPUT_TENSORS],
output = [MY_OUTPUT_TENSORS])
distributed_strategy = tf.distribute.MirroredStrategy()
with distributed_strategy.scope():
train_model = build_model(**model_params)
train_model.compile(...)
train_model.fit(X_dict, y_dict)
This runs on a 50% sample of the data, but returns OOM on the full sample on 2 GPUs. The full data set appears to be copied to each of the 2 16gb GPUs available. The same model runs with a 100% sample on a single 24gb GPU.
Here's how I got it working with tf.data.Dataset.from_tensor_slices() and tf.distribute.MirroredStrategy.experimental_distribute_dataset():
#Data exists in the form of dictionaries of large numpy arrays
x_train, y_train, x_validation, y_validation = {},{},{},{}
#Create tensorflow datasets using CPU / system memory
with tf.device("CPU"):
train = tf.data.Dataset.from_tensor_slices((x_train, y_train))
valid = tf.data.Dataset.from_tensor_slices((x_validation, y_validation))
batch_size = 1024
epochs = 30
distributed_strategy = tf.distribute.MirroredStrategy()
num_gpu = distributed_strategy.num_replicas_in_sync
#Create a distributed dataset from the tensorflow datasets.
#The data gets streamed to the GPUs, so shuffling, repetition / epoch, and batch
#size need to be manually specified
train = train.shuffle(100*batch_size).repeat(epochs).batch(num_gpu * batch_size, drop_remainder=True)
train_dist = distributed_strategy.experimental_distribute_dataset(train)
valid = valid.repeat(epochs).batch(num_gpu * batch_size, drop_remainder=True)
#Build and compile the model
with distributed_strategy.scope():
train_model = build_model(**model_params)
train_model.compile(
optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate),
loss = losses,
loss_weights = weights )
#Train the model. steps_per_epoch and validation_steps need to be specified.
train_model.fit(
train_dist,
validation_data = valid,
epochs = epochs,
steps_per_epoch = int(len(train)//epochs),
validation_steps = int(len(valid)//epochs),
use_multiprocessing = True,
verbose = 1,
)
I just upgraded to tensorflow 2.3.
I want to make my own data generator for training.
With tensorflow 1.x, I did this:
def get_data_generator(test_flag):
item_list = load_item_list(test_flag)
print('data loaded')
while True:
X = []
Y = []
for _ in range(BATCH_SIZE):
x, y = get_random_augmented_sample(item_list)
X.append(x)
Y.append(y)
yield np.asarray(X), np.asarray(Y)
data_generator_train = get_data_generator(False)
data_generator_test = get_data_generator(True)
model.fit_generator(data_generator_train, validation_data=data_generator_test,
epochs=10000, verbose=2,
use_multiprocessing=True,
workers=8,
validation_steps=100,
steps_per_epoch=500,
)
This code worked fine with tensorflow 1.x. 8 processes were created in the system. The processor and video card were loaded perfectly. "data loaded" was printed 8 times.
With tensorflow 2.3 i got warning:
WARNING: tensorflow: multiprocessing can interact badly with TensorFlow, causing nondeterministic deadlocks. For high performance data pipelines tf.data is recommended.
"data loaded" was printed once(should 8 times). GPU is not fully utilized. It also have memory leak every epoch, so traning will stops after several epochs. use_multiprocessing flag did not help.
How to make a generator / iterator in tensorflow(keras) 2.x that can easily be parallelized across multiple CPU processes? Deadlocks and data order are not important.
With a tf.data pipeline, there are several spots where you can parallelize. Depending on how your data are stored and read, you can parallelize reading. You can also parallelize augmentation, and you can prefetch data as you train, so your GPU (or other hardware) is never hungry for data.
In the code below, I have demonstrated how you can parallelize augmentation and add prefetching.
import numpy as np
import tensorflow as tf
x_shape = (32, 32, 3)
y_shape = () # A single item (not array).
classes = 10
# This is tf.data.experimental.AUTOTUNE in older tensorflow.
AUTOTUNE = tf.data.AUTOTUNE
def generator_fn(n_samples):
"""Return a function that takes no arguments and returns a generator."""
def generator():
for i in range(n_samples):
# Synthesize an image and a class label.
x = np.random.random_sample(x_shape).astype(np.float32)
y = np.random.randint(0, classes, size=y_shape, dtype=np.int32)
yield x, y
return generator
def augment(x, y):
return x * tf.random.normal(shape=x_shape), y
samples = 10
batch_size = 5
epochs = 2
# Create dataset.
gen = generator_fn(n_samples=samples)
dataset = tf.data.Dataset.from_generator(
generator=gen,
output_types=(np.float32, np.int32),
output_shapes=(x_shape, y_shape)
)
# Parallelize the augmentation.
dataset = dataset.map(
augment,
num_parallel_calls=AUTOTUNE,
# Order does not matter.
deterministic=False
)
dataset = dataset.batch(batch_size, drop_remainder=True)
# Prefetch some batches.
dataset = dataset.prefetch(AUTOTUNE)
# Prepare model.
model = tf.keras.applications.VGG16(weights=None, input_shape=x_shape, classes=classes)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")
# Train. Do not specify batch size because the dataset takes care of that.
model.fit(dataset, epochs=epochs)
I am using TensorFlow 1.9 to train an image dataset, which is too big to load from my hard drive into RAM. Therefore, I have split the dataset into two halves on my hard drive. I want to know what is the most efficient way to train on the entire dataset.
My GPU has 3 GB of memory, and my RAM has 32 GB of memory. The size of each half dataset is 20 GB. My hard drive has plenty of free space (over 1 TB).
My attempt is as follows. I create an initializable tf.Dataset, and then on every epoch, I initialize it twice: once for each of the halves of the dataset. In this way, each epoch sees the entire dataset, but only has to have half of it loaded in RAM at any one time.
However, this is very slow, because it takes a long time to load the data from my hard drive, and also quite a long time to initialize the dataset with this data each time.
Is there a more efficient way to do this?
I have tried training on each half of the dataset for multiple epochs before loading the other half of the dataset, which is much faster, but this gives much worse performance on the validation data. Presumably, this is because the model is overfitting on each half and then not generalising to the data in the other half.
In my code below, I create and save some test data, which is then loaded as described above. The time to load each half dataset is about 5 seconds, and the time to initialize the dataset with this data is about 1 second. This may only seem like small amounts, but it all adds up over multiple epochs. In fact, my computer spends almost as much time loading the data as it does actually training on the data.
import tensorflow as tf
import numpy as np
import time
# Create and save 2 datasets of test NumPy data
dataset_num_elements = 100000
element_dim = 10000
batch_size = 50
test_data = np.zeros([2, int(dataset_num_elements * 0.5), element_dim], dtype=np.float32)
np.savez('test_data_1.npz', x=test_data[0])
np.savez('test_data_2.npz', x=test_data[1])
# Create the TensorFlow dataset
data_placeholder = tf.placeholder(tf.float32, [int(dataset_num_elements * 0.5), element_dim])
dataset = tf.data.Dataset.from_tensor_slices(data_placeholder)
dataset = dataset.shuffle(buffer_size=dataset_num_elements)
dataset = dataset.repeat()
dataset = dataset.batch(batch_size=batch_size)
dataset = dataset.prefetch(1)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
init_op = iterator.initializer
num_batches = int(dataset_num_elements / batch_size)
with tf.Session() as sess:
while True:
for dataset_section in range(2):
# Load the data from the hard drive
t1 = time.time()
print('Loading')
loaded_data = np.load('test_data_' + str(dataset_section + 1) + '.npz')
x = loaded_data['x']
print('Loaded')
t2 = time.time()
loading_time = t2 - t1
print('Loading time = ' + str(loading_time))
# Initialize the dataset with this loaded data
t1 = time.time()
sess.run(init_op, feed_dict={data_placeholder: x})
t2 = time.time()
initialization_time = t2 - t1
print('Initialization time = ' + str(initialization_time))
# Read the data in batches
for i in range(num_batches):
x = sess.run(next_element)
Feed is not an efficient way to input data. You can input data like this:
create a filename dataset containing all the input file names. you can shuffle, repeat the dataset here.
map this dataset to data, map function is to read, decode, transform image. Use multi-thread for the map convert.
prefetch the data to train.
This is just an example way. You could design your own pipeline, remember the following:
use lightweight feed as possible
use multi-thread to read and preprocess
prefetch data for training
I found that, in this program, TF Estimator paired with Dataset uses unjustifiably huge amount of memory (about 1 GB) and takes tens of minutes, although the batch size is only 10 and the number of iterations is 100.
The code for model initialisation:
classifier = tf.estimator.LinearClassifier (
feature_columns=construct_feature_columns(),
n_classes=10,
optimizer=my_optimizer,
config=tf.estimator.RunConfig(keep_checkpoint_max=1)
)
Fit procedure invoked:
classifier.train(
input_fn=training_input_fn,
steps=steps_per_period
)
The program classifies a 10k-example MNIST dataset (~60 MB in memory) with Logistic Regression, while the same process takes only seconds and a bit of memory with sklearn's LogisticRegression. Could anyone please give advice on what the primary memory consumer is here or how I can probably trace the memory usage?
UPD. I carried out another experiment to compare feeding data to the model using placeholders and Dataset class. I implemented a custom logistic regression (as opposed to the one from Estimator) and found out that feeding data using Dataset:
ds = Dataset.from_tensor_slices((train_x, train_y))
ds = ds.batch(10).repeat(num_epochs)
ds = ds.shuffle(buffer_size=10000)
x,y = ds.make_one_shot_iterator().get_next()
...
sess.run(my_optimiser)
results in consumption of at least 5x as much memory during the training as with Placeholders:
x = tf.placeholder(tf.float32, (None,784), 'pixels')
y = tf.placeholder(tf.float32, (None), 'targets')
...
sess.run(my_optimiser, feed_dict={x:train_x, y:train_y})
Previously I manually trained my model using model.fit() inside a for loop to train it on small batches of data, due to memory constraints. The problem with this is that I can't have access to all previous histories through history.history, because it's like each time a new model is trained, and previous histories aren't stored anywhere.
When I use model.fit() on a 500 batch size, around 7 GB of my ram gets full. I use keras with tensorflow-cpu back end.
But when I use a generator, even with a batch size of 50 won't fit in memory, and gets swapped onto the disk.
I'm performing classification, using 224*224 images, and I am trying to fine tune vgg face. I'm using vgg face implemented according to this link:
VGG-Face
I'm using ResNet and SeNet architectures, as described in the link.
I've previously shuffled my data. I've put aside %20 of my data for test.
My data, image addresses and labels, are stored in a list. The %20 of my training data will be used for validation. For example if batch size is equal to 50, train_data_generator will create a batch with size 40 from the first %80 portion of training data, and vl_data_generator will create a batch with size 10 from the last %20 portion of training data. I've written a class, and by creating an instance and invoking train method
through it, I perform training. Here are generator and training parts of my code, excluding model definitions:
def prepare_input_data(self, batch_addresses):
image = []
for j in range(len(batch_addresses)):
img = cv2.imread(batch_addresses[j])
img = cv2.resize(img, (224, 224))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = img - np.array([103.939, 116.779, 123.68])
image.append(img)
data = np.array(image)
data = data.astype('float32')
data /= 255
return data
def train_data_generator(self, addresses, labels, batch_size):
"""Train data generator"""
#Use first %80 of data for training.
addresses = addresses[: int(0.8 * len(addresses))]
labels = labels[: int(0.8 * len(labels))]
total_data = len(addresses)
while 1:
for i in range(total_data / batch_size):
batch_addresses = addresses[i * batch_size: (i + 1) * batch_size]
batch_labels = labels[i * batch_size: (i + 1) * batch_size]
data = self.prepare_input_data(batch_addresses)
batch_labels = np_utils.to_categorical(batch_labels, self.nb_class)
yield data, batch_labels
def val_data_generator(self, addresses, labels, batch_size):
"""Validation data generator"""
#Use the last %20 of data for validation
addresses = addresses[int(0.8 * len(addresses)):]
labels = labels[int(0.8 * len(labels)):]
total_data = len(addresses)
image = []
while 1:
for i in range(total_data / batch_size):
batch_addresses = addresses[i * batch_size: (i + 1) * batch_size]
batch_labels = labels[i * batch_size: (i + 1) * batch_size]
data = self.prepare_input_data(batch_addresses)
batch_labels = np_utils.to_categorical(batch_labels, self.nb_class)
yield data, batch_labels
def train(self, label_interested_in):
"""Trains the model"""
#Read training data from json file, and get addresses and labels
addresses, labels = self.create_address_and_label(label_interested_in)
batch_size = 50
train_batch_size = 40
val_batch_size = 10
steps = int(len(addresses) / batch_size) + 1
print(len(addresses), steps)
#Perform training
history = self.custom_vgg_model.fit_generator(
self.train_data_generator(addresses, labels, train_batch_size),
steps_per_epoch=steps, epochs=self.number_of_epochs,
verbose=1, validation_data=self.val_data_generator(addresses, labels, val_batch_size),
validation_steps=steps, initial_epoch=0)
Why am I seeing such high memory usage? Is it because the way generators work in keras? I read that generators prepare batches beforehand to speedup the training process by running in parallel with the training. Or am I doing something wrong?
As a side question, since there isn't a batch_size argument in fit_generator(), am I correct in assuming that data gets loaded into the model based on generators and gradient updates are performed after each training and validation batch is loaded?
Try workers=0
This will not invoke any multiprocessing which is intended to fill up the queue beforehand up to the max_queue_size argument with using k workers.
What this does is; prepare a queue of generated data on CPU while training is ongoing on GPU so no time is lost and avoid bottlenecks.
For your need workers=0 will work
For deeper inquiry refer to
keras fit_generator