I found that, in this program, TF Estimator paired with Dataset uses unjustifiably huge amount of memory (about 1 GB) and takes tens of minutes, although the batch size is only 10 and the number of iterations is 100.
The code for model initialisation:
classifier = tf.estimator.LinearClassifier (
feature_columns=construct_feature_columns(),
n_classes=10,
optimizer=my_optimizer,
config=tf.estimator.RunConfig(keep_checkpoint_max=1)
)
Fit procedure invoked:
classifier.train(
input_fn=training_input_fn,
steps=steps_per_period
)
The program classifies a 10k-example MNIST dataset (~60 MB in memory) with Logistic Regression, while the same process takes only seconds and a bit of memory with sklearn's LogisticRegression. Could anyone please give advice on what the primary memory consumer is here or how I can probably trace the memory usage?
UPD. I carried out another experiment to compare feeding data to the model using placeholders and Dataset class. I implemented a custom logistic regression (as opposed to the one from Estimator) and found out that feeding data using Dataset:
ds = Dataset.from_tensor_slices((train_x, train_y))
ds = ds.batch(10).repeat(num_epochs)
ds = ds.shuffle(buffer_size=10000)
x,y = ds.make_one_shot_iterator().get_next()
...
sess.run(my_optimiser)
results in consumption of at least 5x as much memory during the training as with Placeholders:
x = tf.placeholder(tf.float32, (None,784), 'pixels')
y = tf.placeholder(tf.float32, (None), 'targets')
...
sess.run(my_optimiser, feed_dict={x:train_x, y:train_y})
Related
I'm trying to understand how to use multiple gpus to train a model on data too large for the GPU memory. Using tf.distribute.MirroredStrategy seems to copy the full data set to each GPU. What I'm hoping to do is to send a subset of the full dataset to each GPU (2 or 4 gpus) and use MirroredStrategy to reconcile parameter updates on each epoch.
MirroredStrategy.distribute_datasets_from_function() looks promising.
https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy#distribute_datasets_from_function
Problem details:
A fairly complicated multimodal NN with ~200k parameters synthesizing many text, transactional, and structured inputs and with multiple regression and probabilistic outputs. I'm looking at moving development from a single GPU with 24gb memory to cloud compute with multiple 16gb cards on a single node.
The input and targets are currently dictionaries of numpy arrays. I'm hoping for a toy example converting those dictionaries into a distributed data set through to training with different subsets of the full data set assigned to each GPU.
I attempted this:
def build_model(**model_params):
'''
Builds a model from model_params
'''
return tf.keras.Model(
inputs = [MY_INPUT_TENSORS],
output = [MY_OUTPUT_TENSORS])
distributed_strategy = tf.distribute.MirroredStrategy()
with distributed_strategy.scope():
train_model = build_model(**model_params)
train_model.compile(...)
train_model.fit(X_dict, y_dict)
This runs on a 50% sample of the data, but returns OOM on the full sample on 2 GPUs. The full data set appears to be copied to each of the 2 16gb GPUs available. The same model runs with a 100% sample on a single 24gb GPU.
Here's how I got it working with tf.data.Dataset.from_tensor_slices() and tf.distribute.MirroredStrategy.experimental_distribute_dataset():
#Data exists in the form of dictionaries of large numpy arrays
x_train, y_train, x_validation, y_validation = {},{},{},{}
#Create tensorflow datasets using CPU / system memory
with tf.device("CPU"):
train = tf.data.Dataset.from_tensor_slices((x_train, y_train))
valid = tf.data.Dataset.from_tensor_slices((x_validation, y_validation))
batch_size = 1024
epochs = 30
distributed_strategy = tf.distribute.MirroredStrategy()
num_gpu = distributed_strategy.num_replicas_in_sync
#Create a distributed dataset from the tensorflow datasets.
#The data gets streamed to the GPUs, so shuffling, repetition / epoch, and batch
#size need to be manually specified
train = train.shuffle(100*batch_size).repeat(epochs).batch(num_gpu * batch_size, drop_remainder=True)
train_dist = distributed_strategy.experimental_distribute_dataset(train)
valid = valid.repeat(epochs).batch(num_gpu * batch_size, drop_remainder=True)
#Build and compile the model
with distributed_strategy.scope():
train_model = build_model(**model_params)
train_model.compile(
optimizer = tf.keras.optimizers.Adam(learning_rate = learning_rate),
loss = losses,
loss_weights = weights )
#Train the model. steps_per_epoch and validation_steps need to be specified.
train_model.fit(
train_dist,
validation_data = valid,
epochs = epochs,
steps_per_epoch = int(len(train)//epochs),
validation_steps = int(len(valid)//epochs),
use_multiprocessing = True,
verbose = 1,
)
Tensorflow Version : 2.1.0
Model built using tf.keras
Graphics Card : Nvidia GTX 1660TI 6GB DDR6
CPU : Intel i7 9th Gen
Ram : 16 GB DDR4
Storage Disk : SSD (NVME)
I wrote a code to read audio files in batches in a multithread manner using tf.keras.Sequences with multiple workers, but the issue with that code is the CPU is not concurrently reading the next set of audio batches while training the GPU due to which the GPU is being only utilized upto 30 percent of its max capacity (Training time for an epoch is around 25 minutes).
So I decided to move to tf.data.Datasets.from_generator to use the existing generator function to read the batches in a more efficient manner. But that input pipeline is performing more bad (taking 47 minutes to train an epoch). I have attached the code that I used to read create the input pipeline. I have read the file names and their categories from an excel file and fed them to the generator and created the pipeline.
Even after after applying prefetch the pipeline was performing really worse.
Since this is the first time that I am using tf.data API, I would like some insights if I have made any mistakes or not.
This is my code to generate the batches.
# Function read the audio files
def get_x(file):
data = []
for i in file:
audio, fs = sf.read(i, dtype="float32")
data.append(audio[::2])
data = np.array(data, dtype=np.float32)
data = np.expand_dims(data, axis=-1)
return data
def data_generator(files, labels, batchsize):
while True:
start = 0
end = batchsize
while start < len(files):
x = get_x(files[start:end])
y = np.array(tf.keras.utils.to_categorical(labels[start:end], num_classes=2), dtype=np.float32)
yield x, y
start += batchsize
end += batchsize
# Get the tensorflow data dataset object to generate batches
def tf_data_dataset(files, labels, batch_size):
autotune = tf.data.experimental.AUTOTUNE
dataset = tf.data.Dataset.from_generator(
data_generator,
output_types=(np.float32, np.float32),
output_shapes=(tf.TensorShape([None, 16000, 1]),
tf.TensorShape([None, 2])),
args=(files, labels, batch_size))
dataset = dataset.prefetch(buffer_size=autotune)
return dataset
I am using TensorFlow 1.9 to train an image dataset, which is too big to load from my hard drive into RAM. Therefore, I have split the dataset into two halves on my hard drive. I want to know what is the most efficient way to train on the entire dataset.
My GPU has 3 GB of memory, and my RAM has 32 GB of memory. The size of each half dataset is 20 GB. My hard drive has plenty of free space (over 1 TB).
My attempt is as follows. I create an initializable tf.Dataset, and then on every epoch, I initialize it twice: once for each of the halves of the dataset. In this way, each epoch sees the entire dataset, but only has to have half of it loaded in RAM at any one time.
However, this is very slow, because it takes a long time to load the data from my hard drive, and also quite a long time to initialize the dataset with this data each time.
Is there a more efficient way to do this?
I have tried training on each half of the dataset for multiple epochs before loading the other half of the dataset, which is much faster, but this gives much worse performance on the validation data. Presumably, this is because the model is overfitting on each half and then not generalising to the data in the other half.
In my code below, I create and save some test data, which is then loaded as described above. The time to load each half dataset is about 5 seconds, and the time to initialize the dataset with this data is about 1 second. This may only seem like small amounts, but it all adds up over multiple epochs. In fact, my computer spends almost as much time loading the data as it does actually training on the data.
import tensorflow as tf
import numpy as np
import time
# Create and save 2 datasets of test NumPy data
dataset_num_elements = 100000
element_dim = 10000
batch_size = 50
test_data = np.zeros([2, int(dataset_num_elements * 0.5), element_dim], dtype=np.float32)
np.savez('test_data_1.npz', x=test_data[0])
np.savez('test_data_2.npz', x=test_data[1])
# Create the TensorFlow dataset
data_placeholder = tf.placeholder(tf.float32, [int(dataset_num_elements * 0.5), element_dim])
dataset = tf.data.Dataset.from_tensor_slices(data_placeholder)
dataset = dataset.shuffle(buffer_size=dataset_num_elements)
dataset = dataset.repeat()
dataset = dataset.batch(batch_size=batch_size)
dataset = dataset.prefetch(1)
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
init_op = iterator.initializer
num_batches = int(dataset_num_elements / batch_size)
with tf.Session() as sess:
while True:
for dataset_section in range(2):
# Load the data from the hard drive
t1 = time.time()
print('Loading')
loaded_data = np.load('test_data_' + str(dataset_section + 1) + '.npz')
x = loaded_data['x']
print('Loaded')
t2 = time.time()
loading_time = t2 - t1
print('Loading time = ' + str(loading_time))
# Initialize the dataset with this loaded data
t1 = time.time()
sess.run(init_op, feed_dict={data_placeholder: x})
t2 = time.time()
initialization_time = t2 - t1
print('Initialization time = ' + str(initialization_time))
# Read the data in batches
for i in range(num_batches):
x = sess.run(next_element)
Feed is not an efficient way to input data. You can input data like this:
create a filename dataset containing all the input file names. you can shuffle, repeat the dataset here.
map this dataset to data, map function is to read, decode, transform image. Use multi-thread for the map convert.
prefetch the data to train.
This is just an example way. You could design your own pipeline, remember the following:
use lightweight feed as possible
use multi-thread to read and preprocess
prefetch data for training
I'd like to speed up my training routine that uses the Estimator API with input_fn wrote using tf.data.Dataset.
My implementation takes 2 second to prepare a batch of data and then runs training on GPU for 1 sec, and then start over preparing a batch. Which is really inefficient.
I'm looking for a way to prepare the batches asynchronously and upload them to GPU to speed up the training. Or alternatively for a way to cache datasets between invocations of input_fn (the dataset.cache() doesn't seems to be a good choice as the dataset has to be recreated on each input_fn invocation).
Here is a simplified version of my code:
def input_fn(filenames, labels, epochs):
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_read_wav, num_parallel_calls=num_map_threads)
if shuffle:
dataset = dataset.shuffle(buffer_size=len(labels))
dataset = dataset.map(_post_process, num_parallel_calls=num_map_threads)
dataset = dataset.map(lambda wav, label: ({'wav': wav}, label))
dataset = dataset.batch(128)
dataset = dataset.repeat(epochs) # to iterate over the training set forever
iterator = dataset.dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
return features, labels
train_input_fn = lambda : input_fn(train_files, train_labels, None)
eval_input_fn = lambda : input_fn(eval_files, eval_labels, 1)
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=45000)
eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
I've noticed that the Estimator API is under active development and in the master branch of tensorflow the input_fn can return datasets already, so maybe I'm asking too early and this feature isn't ready yet. But if so, please provide a ticket where this implementation can be tracked.
Using tf.data.Dataset.cache() is indeed not a good choice since it will cache the whole dataset into memory, which takes time and might overflow your memory.
The way to go is to use tf.data.Dataset.prefetch() at the end of your pipeline, which will always make sure that the data pipeline holds buffer_size elements. It is usually enough to have buffer_size = 1 at the end:
dataset = ...
dataset = dataset.batch(128)
dataset = dataset.prefetch(1) # prefetch one batch
As explained by #mrry in this answer, you can also try to increase the number of prefetched batches a bit.
Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary.
If you still have a slow input pipeline compared to your GPU computations, you need to increase the number of threads working in parallel using the num_parallel_calls argument of tf.data.Dataset.map().
A few points to add to Olivier's answer, mostly from this post:
repeat before shuffle is slightly faster, at the downside of blurred epoch boundaries. This may be significant in rare cases, but I doubt it.
shuffle before mapping - this reduces the memory foot print of your shuffle buffer size, since it only needs to buffer the filenames rather than the file contents.
it makes more sense to me to apply the third map transform to the output of get_next() rather than the dataset - not sure if that affects speed much. You could also consider putting both other map calls in the same one to reduce scheduling issues.
experiment with repeat before batching. Probably won't make a difference, but might be minor. If you repeat before shuffle as mentioned above you'll have to.
as mentioned by Olivier, use prefetch.
Code with modifications:
def input_fn(filenames, labels, epochs):
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.repeat(epochs)
if shuffle:
dataset = dataset.shuffle(buffer_size=len(labels))
def combined_map_fn(*args):
return _post_process(_read_wav(*args))
dataset = dataset.map(combined_map_fn, num_parallel_calls=num_map_threads)
dataset = dataset.batch(128)
dataset = dataset.prefetch(1)
iterator = dataset.dataset.make_one_shot_iterator()
wavs, labels = iterator.get_next()
features = {'wav': wavs}
return features, labels
Hi I don't understand the keras fit_generator docs.
I hope my confusion is rational.
There is a batch_size and also the concept of training in in batches. Using model_fit(), I specify a batch_size of 128.
To me this means that my dataset will be fed in 128 samples at a time, thereby greatly alleviating memory. It should allow a 100 million sample dataset to be trained as long as I have got the time to wait. After all, keras is only "working with" 128 samples at a time. Right?
But I highly suspect that for specifying the batch_size alone doesn't do what I want whatsoever. Tons of memory is still being used. For my goals I need to train in batches of 128 examples each.
So I am guessing this is what fit_generator does. I really want to ask why doesn't batch_size actually work as it's name suggests?
More importantly, if fit_generator is needed, where do I specify the batch_size? The docs say to loop indefinitely.
A generator loops over every row once. How do I loop over 128 samples at a time and remember where I last stopped and recall it the next time that keras asks for the next batch's starting row number (would be row 129 after first batch is done).
You will need to handle the batch size somehow inside the generator. Here is an example to generate random batches:
import numpy as np
data = np.arange(100)
data_lab = data%2
wholeData = np.array([data, data_lab])
wholeData = wholeData.T
def data_generator(all_data, batch_size = 20):
while True:
idx = np.random.randint(len(all_data), size=batch_size)
# Assuming the last column contains labels
batch_x = all_data[idx, :-1]
batch_y = all_data[idx, -1]
# Return a tuple of (Xs,Ys) to feed the model
yield(batch_x, batch_y)
print([x for x in data_generator(wholeData)])
First, keras batch_size does work very well. If you are working on GPU, you should know that the model can be very heavy with keras, especially if you are using recurrent cells. If you are working on CPU, the whole program is loaded in memory, the batch size won't have much of an impact on the memory. If you are using fit(), the whole dataset is probably loaded in memory, keras produces batches at every step. It's very difficult to predict the amount of memory that will be used.
As for the fit_generator() method, you should build a python generator function (using yield instead of return), yielding one batch at every step. The yield should be in an infinite loop (we often use while true: ...).
Do you have some code to illustrate your problem?