TensorFlow takes too long to load data into a tf.Dataset - tensorflow

I am using TensorFlow 1.9 to train an image dataset, which is too big to load from my hard drive into RAM. Therefore, I have split the dataset into two halves on my hard drive. I want to know what is the most efficient way to train on the entire dataset.
My GPU has 3 GB of memory, and my RAM has 32 GB of memory. The size of each half dataset is 20 GB. My hard drive has plenty of free space (over 1 TB).
My attempt is as follows. I create an initializable tf.Dataset, and then on every epoch, I initialize it twice: once for each of the halves of the dataset. In this way, each epoch sees the entire dataset, but only has to have half of it loaded in RAM at any one time.
However, this is very slow, because it takes a long time to load the data from my hard drive, and also quite a long time to initialize the dataset with this data each time.
Is there a more efficient way to do this?
I have tried training on each half of the dataset for multiple epochs before loading the other half of the dataset, which is much faster, but this gives much worse performance on the validation data. Presumably, this is because the model is overfitting on each half and then not generalising to the data in the other half.
In my code below, I create and save some test data, which is then loaded as described above. The time to load each half dataset is about 5 seconds, and the time to initialize the dataset with this data is about 1 second. This may only seem like small amounts, but it all adds up over multiple epochs. In fact, my computer spends almost as much time loading the data as it does actually training on the data.
import tensorflow as tf
import numpy as np
import time
# Create and save 2 datasets of test NumPy data
dataset_num_elements = 100000
element_dim = 10000
batch_size = 50
test_data = np.zeros([2, int(dataset_num_elements * 0.5), element_dim], dtype=np.float32)
np.savez('test_data_1.npz', x=test_data[0])
np.savez('test_data_2.npz', x=test_data[1])
# Create the TensorFlow dataset
data_placeholder = tf.placeholder(tf.float32, [int(dataset_num_elements * 0.5), element_dim])
dataset = tf.data.Dataset.from_tensor_slices(data_placeholder)
dataset = dataset.shuffle(buffer_size=dataset_num_elements)
dataset = dataset.repeat()
dataset = dataset.batch(batch_size=batch_size)
dataset = dataset.prefetch(1)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
init_op = iterator.initializer
num_batches = int(dataset_num_elements / batch_size)
with tf.Session() as sess:
while True:
for dataset_section in range(2):
# Load the data from the hard drive
t1 = time.time()
print('Loading')
loaded_data = np.load('test_data_' + str(dataset_section + 1) + '.npz')
x = loaded_data['x']
print('Loaded')
t2 = time.time()
loading_time = t2 - t1
print('Loading time = ' + str(loading_time))
# Initialize the dataset with this loaded data
t1 = time.time()
sess.run(init_op, feed_dict={data_placeholder: x})
t2 = time.time()
initialization_time = t2 - t1
print('Initialization time = ' + str(initialization_time))
# Read the data in batches
for i in range(num_batches):
x = sess.run(next_element)

Feed is not an efficient way to input data. You can input data like this:
create a filename dataset containing all the input file names. you can shuffle, repeat the dataset here.
map this dataset to data, map function is to read, decode, transform image. Use multi-thread for the map convert.
prefetch the data to train.
This is just an example way. You could design your own pipeline, remember the following:
use lightweight feed as possible
use multi-thread to read and preprocess
prefetch data for training

Related

How to read audio files using tf.data.Dataset.from_generator

Tensorflow Version : 2.1.0
Model built using tf.keras
Graphics Card : Nvidia GTX 1660TI 6GB DDR6
CPU : Intel i7 9th Gen
Ram : 16 GB DDR4
Storage Disk : SSD (NVME)
I wrote a code to read audio files in batches in a multithread manner using tf.keras.Sequences with multiple workers, but the issue with that code is the CPU is not concurrently reading the next set of audio batches while training the GPU due to which the GPU is being only utilized upto 30 percent of its max capacity (Training time for an epoch is around 25 minutes).
So I decided to move to tf.data.Datasets.from_generator to use the existing generator function to read the batches in a more efficient manner. But that input pipeline is performing more bad (taking 47 minutes to train an epoch). I have attached the code that I used to read create the input pipeline. I have read the file names and their categories from an excel file and fed them to the generator and created the pipeline.
Even after after applying prefetch the pipeline was performing really worse.
Since this is the first time that I am using tf.data API, I would like some insights if I have made any mistakes or not.
This is my code to generate the batches.
# Function read the audio files
def get_x(file):
data = []
for i in file:
audio, fs = sf.read(i, dtype="float32")
data.append(audio[::2])
data = np.array(data, dtype=np.float32)
data = np.expand_dims(data, axis=-1)
return data
def data_generator(files, labels, batchsize):
while True:
start = 0
end = batchsize
while start < len(files):
x = get_x(files[start:end])
y = np.array(tf.keras.utils.to_categorical(labels[start:end], num_classes=2), dtype=np.float32)
yield x, y
start += batchsize
end += batchsize
# Get the tensorflow data dataset object to generate batches
def tf_data_dataset(files, labels, batch_size):
autotune = tf.data.experimental.AUTOTUNE
dataset = tf.data.Dataset.from_generator(
data_generator,
output_types=(np.float32, np.float32),
output_shapes=(tf.TensorShape([None, 16000, 1]),
tf.TensorShape([None, 2])),
args=(files, labels, batch_size))
dataset = dataset.prefetch(buffer_size=autotune)
return dataset

Tensorflow Estimator memory usage

I found that, in this program, TF Estimator paired with Dataset uses unjustifiably huge amount of memory (about 1 GB) and takes tens of minutes, although the batch size is only 10 and the number of iterations is 100.
The code for model initialisation:
classifier = tf.estimator.LinearClassifier (
feature_columns=construct_feature_columns(),
n_classes=10,
optimizer=my_optimizer,
config=tf.estimator.RunConfig(keep_checkpoint_max=1)
)
Fit procedure invoked:
classifier.train(
input_fn=training_input_fn,
steps=steps_per_period
)
The program classifies a 10k-example MNIST dataset (~60 MB in memory) with Logistic Regression, while the same process takes only seconds and a bit of memory with sklearn's LogisticRegression. Could anyone please give advice on what the primary memory consumer is here or how I can probably trace the memory usage?
UPD. I carried out another experiment to compare feeding data to the model using placeholders and Dataset class. I implemented a custom logistic regression (as opposed to the one from Estimator) and found out that feeding data using Dataset:
ds = Dataset.from_tensor_slices((train_x, train_y))
ds = ds.batch(10).repeat(num_epochs)
ds = ds.shuffle(buffer_size=10000)
x,y = ds.make_one_shot_iterator().get_next()
...
sess.run(my_optimiser)
results in consumption of at least 5x as much memory during the training as with Placeholders:
x = tf.placeholder(tf.float32, (None,784), 'pixels')
y = tf.placeholder(tf.float32, (None), 'targets')
...
sess.run(my_optimiser, feed_dict={x:train_x, y:train_y})

Keras fit_generator using a lot of memory even with small batch sizes

Previously I manually trained my model using model.fit() inside a for loop to train it on small batches of data, due to memory constraints. The problem with this is that I can't have access to all previous histories through history.history, because it's like each time a new model is trained, and previous histories aren't stored anywhere.
When I use model.fit() on a 500 batch size, around 7 GB of my ram gets full. I use keras with tensorflow-cpu back end.
But when I use a generator, even with a batch size of 50 won't fit in memory, and gets swapped onto the disk.
I'm performing classification, using 224*224 images, and I am trying to fine tune vgg face. I'm using vgg face implemented according to this link:
VGG-Face
I'm using ResNet and SeNet architectures, as described in the link.
I've previously shuffled my data. I've put aside %20 of my data for test.
My data, image addresses and labels, are stored in a list. The %20 of my training data will be used for validation. For example if batch size is equal to 50, train_data_generator will create a batch with size 40 from the first %80 portion of training data, and vl_data_generator will create a batch with size 10 from the last %20 portion of training data. I've written a class, and by creating an instance and invoking train method
through it, I perform training. Here are generator and training parts of my code, excluding model definitions:
def prepare_input_data(self, batch_addresses):
image = []
for j in range(len(batch_addresses)):
img = cv2.imread(batch_addresses[j])
img = cv2.resize(img, (224, 224))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = img - np.array([103.939, 116.779, 123.68])
image.append(img)
data = np.array(image)
data = data.astype('float32')
data /= 255
return data
def train_data_generator(self, addresses, labels, batch_size):
"""Train data generator"""
#Use first %80 of data for training.
addresses = addresses[: int(0.8 * len(addresses))]
labels = labels[: int(0.8 * len(labels))]
total_data = len(addresses)
while 1:
for i in range(total_data / batch_size):
batch_addresses = addresses[i * batch_size: (i + 1) * batch_size]
batch_labels = labels[i * batch_size: (i + 1) * batch_size]
data = self.prepare_input_data(batch_addresses)
batch_labels = np_utils.to_categorical(batch_labels, self.nb_class)
yield data, batch_labels
def val_data_generator(self, addresses, labels, batch_size):
"""Validation data generator"""
#Use the last %20 of data for validation
addresses = addresses[int(0.8 * len(addresses)):]
labels = labels[int(0.8 * len(labels)):]
total_data = len(addresses)
image = []
while 1:
for i in range(total_data / batch_size):
batch_addresses = addresses[i * batch_size: (i + 1) * batch_size]
batch_labels = labels[i * batch_size: (i + 1) * batch_size]
data = self.prepare_input_data(batch_addresses)
batch_labels = np_utils.to_categorical(batch_labels, self.nb_class)
yield data, batch_labels
def train(self, label_interested_in):
"""Trains the model"""
#Read training data from json file, and get addresses and labels
addresses, labels = self.create_address_and_label(label_interested_in)
batch_size = 50
train_batch_size = 40
val_batch_size = 10
steps = int(len(addresses) / batch_size) + 1
print(len(addresses), steps)
#Perform training
history = self.custom_vgg_model.fit_generator(
self.train_data_generator(addresses, labels, train_batch_size),
steps_per_epoch=steps, epochs=self.number_of_epochs,
verbose=1, validation_data=self.val_data_generator(addresses, labels, val_batch_size),
validation_steps=steps, initial_epoch=0)
Why am I seeing such high memory usage? Is it because the way generators work in keras? I read that generators prepare batches beforehand to speedup the training process by running in parallel with the training. Or am I doing something wrong?
As a side question, since there isn't a batch_size argument in fit_generator(), am I correct in assuming that data gets loaded into the model based on generators and gradient updates are performed after each training and validation batch is loaded?
Try workers=0
This will not invoke any multiprocessing which is intended to fill up the queue beforehand up to the max_queue_size argument with using k workers.
What this does is; prepare a queue of generated data on CPU while training is ongoing on GPU so no time is lost and avoid bottlenecks.
For your need workers=0 will work
For deeper inquiry refer to
keras fit_generator

How to speed up batch preparation when using Estimators API combined with tf.data.Dataset

I'd like to speed up my training routine that uses the Estimator API with input_fn wrote using tf.data.Dataset.
My implementation takes 2 second to prepare a batch of data and then runs training on GPU for 1 sec, and then start over preparing a batch. Which is really inefficient.
I'm looking for a way to prepare the batches asynchronously and upload them to GPU to speed up the training. Or alternatively for a way to cache datasets between invocations of input_fn (the dataset.cache() doesn't seems to be a good choice as the dataset has to be recreated on each input_fn invocation).
Here is a simplified version of my code:
def input_fn(filenames, labels, epochs):
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_read_wav, num_parallel_calls=num_map_threads)
if shuffle:
dataset = dataset.shuffle(buffer_size=len(labels))
dataset = dataset.map(_post_process, num_parallel_calls=num_map_threads)
dataset = dataset.map(lambda wav, label: ({'wav': wav}, label))
dataset = dataset.batch(128)
dataset = dataset.repeat(epochs) # to iterate over the training set forever
iterator = dataset.dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
return features, labels
train_input_fn = lambda : input_fn(train_files, train_labels, None)
eval_input_fn = lambda : input_fn(eval_files, eval_labels, 1)
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=45000)
eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
I've noticed that the Estimator API is under active development and in the master branch of tensorflow the input_fn can return datasets already, so maybe I'm asking too early and this feature isn't ready yet. But if so, please provide a ticket where this implementation can be tracked.
Using tf.data.Dataset.cache() is indeed not a good choice since it will cache the whole dataset into memory, which takes time and might overflow your memory.
The way to go is to use tf.data.Dataset.prefetch() at the end of your pipeline, which will always make sure that the data pipeline holds buffer_size elements. It is usually enough to have buffer_size = 1 at the end:
dataset = ...
dataset = dataset.batch(128)
dataset = dataset.prefetch(1) # prefetch one batch
As explained by #mrry in this answer, you can also try to increase the number of prefetched batches a bit.
Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary.
If you still have a slow input pipeline compared to your GPU computations, you need to increase the number of threads working in parallel using the num_parallel_calls argument of tf.data.Dataset.map().
A few points to add to Olivier's answer, mostly from this post:
repeat before shuffle is slightly faster, at the downside of blurred epoch boundaries. This may be significant in rare cases, but I doubt it.
shuffle before mapping - this reduces the memory foot print of your shuffle buffer size, since it only needs to buffer the filenames rather than the file contents.
it makes more sense to me to apply the third map transform to the output of get_next() rather than the dataset - not sure if that affects speed much. You could also consider putting both other map calls in the same one to reduce scheduling issues.
experiment with repeat before batching. Probably won't make a difference, but might be minor. If you repeat before shuffle as mentioned above you'll have to.
as mentioned by Olivier, use prefetch.
Code with modifications:
def input_fn(filenames, labels, epochs):
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.repeat(epochs)
if shuffle:
dataset = dataset.shuffle(buffer_size=len(labels))
def combined_map_fn(*args):
return _post_process(_read_wav(*args))
dataset = dataset.map(combined_map_fn, num_parallel_calls=num_map_threads)
dataset = dataset.batch(128)
dataset = dataset.prefetch(1)
iterator = dataset.dataset.make_one_shot_iterator()
wavs, labels = iterator.get_next()
features = {'wav': wavs}
return features, labels

How can I shuffle a whole dataset with TensorFlow?

Now I use following function for shuffling
from tensorflow.contrib import data
def input_pipeline(filenames, batch_size):
# Define a `tf.contrib.data.Dataset` for iterating over one epoch of the data.
dataset = data.TextLineDataset(filenames)
dataset = dataset.map(decode_func)
dataset = dataset.shuffle(buffer_size=10000) # Equivalent to min_after_dequeue=10000.
dataset = dataset.batch(batch_size)
# Return an *initializable* iterator over the dataset, which will allow us to
# re-initialize it at the beginning of each epoch.
return dataset.make_initializable_iterator()
But it will just shuffle data at the amount of buffer_size and it will fill buffer in an order.
My data is enormous which I can not set buffer_size too big. Is there any other solutions I can shuffle the whole datasets?
Currently there is no support in Dataset API for shuffling a whole Dataset (greater then 10k examples). According to this thread, the common approach is:
Randomly shuffle the entire data once using a
MapReduce/Spark/Beam/etc. job to create a set of roughly equal-sized
files ("shards").
In each epoch:
a. Randomly shuffle the list of shard filenames, using Dataset.list_files(...).shuffle(num_shards).
b. Use dataset.interleave(lambda filename: tf.data.TextLineDataset(filename), cycle_length=N) to mix together records from N different shards.
c. Use dataset.shuffle(B) to shuffle the resulting dataset. Setting B might require some experimentation, but you will probably want to set it to some value larger than the number of records in a single shard.