What does the use_multiprocessing input argument in keras mode.fit do? - tensorflow

I am training an LSTM autoencoder model in python using Keras using only CPU.
I can see that there is an argument called use_multiprocessing in the fit function. Could you please explain in simple terms what does this argument do exactly. I read the explanation on tensorflow.org but I cannot understand from it if I set the parameter to true how would my model be impacted. I am looking for ways to speed up the training of my model and I am wondering if this parameter would help.

The use_multiprocessing (and workers and max_queue_size) parameters apply to the batch data generation. The clue in the documentation is this: "Used for generator or keras.utils.Sequence input only" [ref https://keras.io/api/models/model_training_apis/#fit-method]. Deep under the hood, keras uses an orderedenqueuer to wrap your input.
If use_multiprocessing is True and workers > 0, then keras will create multiple (number = workers) processes to run simultaneously and prepare batches from your generator/sequence. They will try to keep the queue of batches ready for training up to max_queue_size.
If use_multiprocessing is False and workers > 1, then keras will create multiple (number = workers) threads to simultaneously prepare batches, similar to above (but your input data object needs to be thread safe).
If your batch data generation is the bottleneck in your training process, this can speed things up a lot. I have found use_multiprocessing True to speed up batch data bound work linearly e.g. 2 workers = 2x as fast (though there is overhead to start the processes). For use_multiprocessing False and threads, I have found an unpredictable 0%-15% speed increase, without overhead.
Refer also this question with lots of details:
How to define max_queue_size, workers and use_multiprocessing in keras fit_generator()?

Related

Why does Keras accept a batch size option for model.evaluate?

Why does the evaluate function of the Keras API in Tensorflow accept a batch_size? To my knowledge, this parameter should only be relevant for managing how many samples we use per iteration during training. What influence does this choice have during model evaluation?
Batch size is mainly used in Sequence-based predictions or in Time series predictions.
Below are the cases where you have to use batch size while prediction.
In Time Series use cases it may be desirable to use a large batch size when training the network and a batch size of 1 when making predictions in order to predict the next step in the sequence.
For Stateful RNN it is required to provide a fixed batch size during prediction/evaluation where the output state of the current batch is used as the initial state for the next batch. They keep information from one batch to another batch.
If your model doesn't fall into these kinds of category technically you don't need to provide batch size as input during evaluating. Even if you provide batch size, it's how much data you are feeding at a time for GPU.

TPU terminology confusion

So I know how epochs, train steps, batch sizes and this kind of stuff are defined, but it is really hard to me to get my head wraped around the TPU terminology like train loops, iterations per loop and so on. I read this but Im still confused.
Also how can I benchmark the time for iterations per loop for example.
Any explanation would help me a lot there. Thanks!
As the other answers have described, iterations_per_loop is a tuning parameter that controls the amount of work done by the TPU before checking in with it again. A lower number lets you inspect results (and benchmark them) more often, and a higher number reduces the overhead due to synchronization.
This is no different from familiar network or file buffering techniques; changing its value affects performance, but not your final result. In contrast, ML hyperparameters like num_epochs, train_steps, or train_batch_size will change your result.
EDIT: Adding an illustration in pseudocode, below. Notionally, the training loop functions like this:
def process_on_TPU(examples, train_batch_size, iterations_per_loop):
# The TPU will run `iterations_per_loop` training iterations before returning to the host
for i in range(0, iterations_per_loop):
# on every iteration, the TPU will compute `train_batch_size` examples,
# calculating the gradient from every example in the given batch
compute(examples[i * train_batch_size : (i + 1) * train_batch_size])
# assume each entry in `example` is a single training example
for b in range(0, train_steps, train_batch_size * iterations_per_loop)
process_on_TPU(examples[b:b + train_batch_size * iterations_per_loop],
train_batch_size,
iterations_per_loop)
From this, it might appear that train_batch_size and iterations_per_loop are simply two different ways of accomplishing the same thing. However, this is not the case; train_batch_size affects the learning rate, since (at least in ResNet-50) the gradient is computed at each iteration from the average of the gradient of every example in the batch. Taking 50 steps per 50k examples will produce a different result from taking from 1k steps per 50k examples, since the latter case calculates the gradient much more often.
EDIT 2: Below is a way to visualize what's happening, with a racing metaphor. Think of the TPU as running a race that has a distance of train_steps examples, and its stride lets it cover a batch of examples per step. The race is on a track, which is shorter than the total race distance; the length of the lap is your total number of training examples, and every lap around the track is one epoch. You can think of iterations_per_loop as being the point where the TPU can stop at a "water station" of sorts where the training is temporarily paused for a variety of tasks (benchmarking, checkpointing, other housekeeping).
By "train loop", I'm assuming it's the same meaning as "training loop". The training loop is the one that iterates through each epoch in order to feed the model.
The iterations per loop is related to how Cloud TPU handles the training loop. In order to amortize the TPU launch cost, the model training step is wrapped in a tf.while_loop, such that one Session run actually runs many iterations for a single training loop.
Because of this, Cloud TPU runs a specified number of iterations of the training loop before returning to the host. Therefore, iterations_per_loop is how many iterations will run for one session.run call.
TPU literally means "Tensor Processing Unit", it's a hardware device used for computation in exactly the same way a GPU is used. The TPUs are effectively Google proprietary GPUs. There are technical differences under the hood of a GPU vs a TPU, mostly regarding speed and power consumption, and some issues of floating point precision, but you don't need to care about the details.
iterations_per_loop appears to be an effort to improve efficiency by loading the TPU with multiple training batches. There are often hardware bandwidth limitations when transferring large amounts of data from main memory to a GPU/TPU.
It appears that the code you reference is passing iterations_per_loop number of training batches to the TPU, then running iterations_per_loop number of training steps before pausing to do another data transfer from main memory to TPU memory.
I'm rather surprised to see that though, I would expect that asynchronous background data transfers would be possible by now.
My only disclaimer is that, while I'm proficient with Tensorflow, and have watched TPU evolution in papers and articles, I'm not directly experienced with the Google API or running on TPUs, so I'm inferring from what I read in the documentation you linked to.

How to efficiently shuffle a large tf.data.Dataset when using tf.estimator.train_and_evaluate?

The tf.estimator.train_and_evaluate documentation makes it clear that the input dataset must be properly shuffled for the training to see all examples:
Overfitting: In order to avoid overfitting, it is recommended to set up the training input_fn to shuffle the training data properly. It is also recommended to train the model a little longer, say multiple epochs, before performing evaluation, as the input pipeline starts from scratch for each training. It is particularly important for local training and evaluation.
In my application, I would like to uniformly sample examples from the full tf.data.Dataset with arbitrary evaluation frequency and shuffle()'s buffer size. Otherwise, the training can at most see the first:
(steps_per_second * eval_delay * batch_size) + buffer_size
elements, effectively discarding the rest. Is there an efficient way to work around that without loading the complete dataset in the system memory?
I considered sharding the dataset based on the buffer size, but if the evaluation does not occur frequently, it will iterate on the same shard multiple times (a repeat() closes the pipeline). Ideally, I would like to move to another shard after a complete iteration over the dataset, is that possible?
Thanks for any pointers!
A random sharding of the dataset can be implemented with this Dataset transformation:
def random_shard(shard_size, dataset_size):
num_shards = -(-dataset_size // shard_size) # Ceil division.
offsets = np.linspace(
0, dataset_size, num=num_shards, endpoint=False, dtype=np.int64)
def _random_shard(dataset):
sharded_dataset = tf.data.Dataset.from_tensor_slices(offsets)
sharded_dataset = sharded_dataset.shuffle(num_shards)
sharded_dataset = sharded_dataset.flat_map(
lambda offset: dataset.skip(offset).take(shard_size))
return sharded_dataset
return _random_shard
This requires to know the total dataset size in advance. However, if you implement a file-based sharding approach, you also iterate on the full dataset once so that is not a major issue.
Regarding efficiency, note that skip(offset) actually iterates on offset examples so a latency is to be expected if offset is large. Careful prefetching should help for this.

How does one move data to multiple GPU towers using Tensorflow's Dataset API

We are running multi GPU jobs on Tensorflow and evaluating a migration from the queue based model (using the string_input_producer interface) to the new Tensorflow Dataset API. The latter appears to offer an easier way to switch between Train and Validation, concurrently.
A snippet of code below shows how we are doing this.
train_dataset, train_iterator = get_dataset(train_files, batch_size, epochs)
val_dataset, val_iterator = get_dataset(val_files, batch_size, epochs)
is_validating = tf.placeholder(dtype=bool, shape=())
next_batch = tf.cond(is_validating,
lambda: val_iterator.get_next(),
lambda: train_iterator.get_next())
validation_tower = self.num_gpus - 1
tower_grads = []
for i in range(self.num_gpus):
with tf.variable_scope(tf.get_variable_scope(),reuse=(i > 0)):
with tf.device('/gpu:%d' % i), tf.name_scope('%s_%d' % ('gpu_', i)) as scope:
if i == validation_tower:
images, labels = next_batch
# Loss funcs snipped out
else:
images, labels = next_batch
# Loss funcs snipped out
The get_dataset function builds a dataset, sets a map function and a batch size. It also builds an iterator, but doesn't initialize it. Initialization of the iterator occurs before the session starts.
The is_validating boolean is supplied while the session is running, and every few steps we pass is_validating as True via a feed_dict to use the validation dataset
The question I have is:
Lets say I have 8 gpus, so we run training on 7 GPUs. Does the Iterator advance from the same point for each of these 7 GPUs, hence supplying all 7 GPU's with the same data?
At present there are three main options, which have different usability and performance trade-offs:
In the Dataset.batch() transform, create a single large batch containing examples for all of your GPUs. Then use tf.split(..., self.num_gpus) on the output of Iterator.get_next() to create sub-batches for each GPU. This is probably the easiest approach, but it does place the splitting on the critical path.
In the Dataset.batch() transform, create a mini-batch that is sized for a single GPU. Then call Iterator.get_next() once per GPU to get multiple different batches. (By contrast, in your current code, the same value of next_batch is sent to each GPU, which is probably not what you wanted to happen.)
Create multiple iterators, one per GPU. Shard the data using Dataset.shard() early in the pipeline (e.g. on the list of files if your dataset is sharded). Note that this approach will consume more resources on the host, so you may need to dial down any buffer sizes and/or degrees of parallelism
Note that the current tf.data pipelines run on the CPU only, and an important aspect of an efficient pipeline is staging your training input to the GPU while the previous step is still running. See the TensorFlow CNN benchmarks for example code that shows how to stage data to GPUs efficiently. We are currently working on adding this support to the tf.data API directly.

Tensorflow input pipeline

I have an input pipeline where samples are generated on fly. I use keras and custom ImageDataGenerator and corresponding Iterator to get samples in memory.
Under assumption that keras in my setup is using feed_dict (and that assumption is a question to me) I am thinking of speeding things up by switching to raw tensorflow + Dataset.from_generator().
Here I see that suggested solution for input pipelines that generate data on fly in the most recent Tensorflow is to use Dataset.from_generator().
Questions:
Does keras with Tensorflow backend use feed_dict method?
If I switch to raw tensorflow + Dataset.from_generator(my_sample_generator) will that cut feed_dict memory copy overhead and buy me performance?
During predict (evaluation) phase apart from batch_x, batch_y I have also opaque index vector from my generator output. That vector corresponds to sample ids in the batch_x. Does that mean that I'm stuck with feed_dict approach for predict phase because I need that extra batch_z output from iterator?
The new tf.contrib.data.Dataset.from_generator() can potentially speed up your input pipeline by overlapping the data preparation with training. However, you will tend to get the best performance by switching over to TensorFlow ops in your input pipeline wherever possible.
To answer your specific questions:
The Keras TensorFlow backend uses tf.placeholder() to represent compiled function inputs, and feed_dict to pass arguments to a function.
With the recent optimizations to tf.py_func() and feed_dict copy overhead, I suspect the amount of time spent in memcpy() will be the same. However, you can more easily use Dataset.from_generator() with Dataset.prefetch() to overlap the training on one batch with preprocessing on the next batch.
It sounds like you can define a separate iterator for the prediction phase. The tf.estimator.Estimator class does something similar by instantiating different "input functions" with different signatures for training and evaluation, then building a separate graph for each role.
Alternatively, you could add a dummy output to your training iterator (for the batch_z values) and switch between training and evaluation iterators using a "feedable iterator".