How does one move data to multiple GPU towers using Tensorflow's Dataset API - tensorflow

We are running multi GPU jobs on Tensorflow and evaluating a migration from the queue based model (using the string_input_producer interface) to the new Tensorflow Dataset API. The latter appears to offer an easier way to switch between Train and Validation, concurrently.
A snippet of code below shows how we are doing this.
train_dataset, train_iterator = get_dataset(train_files, batch_size, epochs)
val_dataset, val_iterator = get_dataset(val_files, batch_size, epochs)
is_validating = tf.placeholder(dtype=bool, shape=())
next_batch = tf.cond(is_validating,
lambda: val_iterator.get_next(),
lambda: train_iterator.get_next())
validation_tower = self.num_gpus - 1
tower_grads = []
for i in range(self.num_gpus):
with tf.variable_scope(tf.get_variable_scope(),reuse=(i > 0)):
with tf.device('/gpu:%d' % i), tf.name_scope('%s_%d' % ('gpu_', i)) as scope:
if i == validation_tower:
images, labels = next_batch
# Loss funcs snipped out
else:
images, labels = next_batch
# Loss funcs snipped out
The get_dataset function builds a dataset, sets a map function and a batch size. It also builds an iterator, but doesn't initialize it. Initialization of the iterator occurs before the session starts.
The is_validating boolean is supplied while the session is running, and every few steps we pass is_validating as True via a feed_dict to use the validation dataset
The question I have is:
Lets say I have 8 gpus, so we run training on 7 GPUs. Does the Iterator advance from the same point for each of these 7 GPUs, hence supplying all 7 GPU's with the same data?

At present there are three main options, which have different usability and performance trade-offs:
In the Dataset.batch() transform, create a single large batch containing examples for all of your GPUs. Then use tf.split(..., self.num_gpus) on the output of Iterator.get_next() to create sub-batches for each GPU. This is probably the easiest approach, but it does place the splitting on the critical path.
In the Dataset.batch() transform, create a mini-batch that is sized for a single GPU. Then call Iterator.get_next() once per GPU to get multiple different batches. (By contrast, in your current code, the same value of next_batch is sent to each GPU, which is probably not what you wanted to happen.)
Create multiple iterators, one per GPU. Shard the data using Dataset.shard() early in the pipeline (e.g. on the list of files if your dataset is sharded). Note that this approach will consume more resources on the host, so you may need to dial down any buffer sizes and/or degrees of parallelism
Note that the current tf.data pipelines run on the CPU only, and an important aspect of an efficient pipeline is staging your training input to the GPU while the previous step is still running. See the TensorFlow CNN benchmarks for example code that shows how to stage data to GPUs efficiently. We are currently working on adding this support to the tf.data API directly.

Related

No cache file written in TensorFlow dataset

I am trying to manage a large image dataset, that does not fit in the memory, while requiring some specific calculation. Currently, my code looks like this:
files = [str(f) for f in self.files]
labels = self.categories
batch_size= 32
dataset = tf.data.Dataset.from_generator(
lambda: zip(files, labels),
output_types=(tf.string, tf.uint8),
output_shapes=(tf.TensorShape([]), tf.TensorShape([]))
)
dataset = dataset.map(
lambda x, y: tf.py_function(_parser, [x, y, category_count], [tf.float32, tf.uint8]),
num_parallel_calls=tf.data.experimental.AUTOTUNE,
deterministic=False)
dataset.cache(filename='/tmp/dataset.tmp')
if mode == tf.estimator.ModeKeys.TRAIN:
dataset = dataset.shuffle(buffer_size=10*batch_size, reshuffle_each_iteration=True)
dataset = dataset.batch(batch_size=batch_size, drop_remainder=False)
if mode == tf.estimator.ModeKeys.TRAIN:
dataset.repeat(None)
else:
dataset.repeat(1)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
The _parser() function opens a image file, does a bunch of transformations, and returns a tensor and a 1-hot encoded vector. The caching step does not seem to work properly, however:
There is not significant improvement of the computation time between the 1st epoch and the following ones
No cache file is created during the process, although the swap partition is almost full (~90%)
Does the cache() function creates a file only when both the memory and the swap partition is full? Furthermore, I expect to read only batch_size files at a time. However, it seems that all files are read at once during the mapping step. Should I consider using interleave() combined with from_generator() instead? Or maybe should I batched the files first, then map them?
Note that cache() should be used when the dataset is small. If the dataset is large(which is in your case) RAM will not be sufficient to cache its content so it does not fit into memory. Either you should increase the capacity of RAM or adapt some other method to speed up the training.
The other reason for the slowdown of training is the preprocessing stage when you use map() function.
map() method applies a transformation to each item unlike apply() method applies a transformation to the dataset as a whole.
You can use the interleave() and retain the same order of map() and then batch().
You are already using threading by making num_parallel_calls and setting it to tf.data.experimental.AUTOTUNE makes the best use of whatever is available.
You can also normalize your input data and then cache, if it does not fit into memory again then it's better not to cache on a large dataset.
You can follow these performance tips from TensorFlow.
If you have multiple workers/devices it will help you to speed up the training.
Below is the sample illustration showing prefetching with multithreaded loading and preprocessing.

Does tensorflow Estimator take different batches for workers when MirroredStrategy is used?

I am using GANEstimator with MirroredStrategy to work on multiple GPUs of single instance. input_fn in my case is tf.data.Dataset with the following settings:
dataset = dataset.repeat()
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(self.batch_size, drop_remainder=True)
dataset = dataset.prefetch(100)
The reason why I am asking this is that do I need to specify something like dataset.shard() manually to have different data being passed to workers? I am digging in the code of Estimator, and MirroredStrategy, but it is unclear to me what is going on. Additional confuse is created from the description of distributed strategies:
MirroredStrategy: This does in-graph replication with synchronous
training on many GPUs on one machine. Essentially, we create copies of all
variables in the model's layers on each device. We then use all-reduce
to combine gradients across the devices before applying them
to the variables to keep them in sync.
CollectiveAllReduceStrategy: This is a version of MirroredStrategy
for multi-worker training.
So does MirroredStratedy use only one worker? I don't understand it. I need to specify batch size equal to capacity of one tower, otherwise I get OOM. Can someone please point me to the code and explain how does such a simple setup work with batches:
def create_dataset():
...
dataset = dataset.repeat()
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(self.batch_size, drop_remainder=True)
dataset = dataset.prefetch(100)
return dataset
NUM_GPUS = 4
strategy = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS)
optimizer = tf.train.RMSPropOptimizer(learning_rate=0.01, use_locking=True)
optimizer_d = tf.train.RMSPropOptimizer(learning_rate=0.01, use_locking=True)
config = tf.estimator.RunConfig(save_checkpoints_steps=100,
save_summary_steps=1, keep_checkpoint_max=50,
train_distribute=strategy)
# I have more hooks here, just simplified to show
def get_hooks_fn(GANTrainOps):
disjoint_train_hook_func = tfgan.get_sequential_train_hooks(
train_steps=tfgan.GANTrainSteps(10, 1)
) # g steps, d steps
disjoint_train_hooks = disjoint_train_hook_func(GANTrainOps)
return [update_hook, summary_hook] + disjoint_train_hooks
# Create GAN estimator.
gan_estimator = tfgan.estimator.GANEstimator(
model_dir = '/data/checkpoints/estimator_model',
generator_fn = generator_fn,
discriminator_fn = discriminator_fn,
generator_loss_fn = generator_loss_fn,
discriminator_loss_fn = discriminator_loss_fn,
generator_optimizer = optimizer,
discriminator_optimizer = optimizer_d,
use_loss_summaries=True,
config=config,
get_hooks_fn=get_hooks_fn)
gan_estimator.train(input_fn=create_dataset, steps=10000)
Thanks!
The code of MirroredStrategy contains:
1) Weird wording:
The multi-worker version of this class maps one replica to one device on a
worker. It mirrors all model variables on all replicas. For example, if you
have two workers and each worker has 4 GPUs, it will create 8 copies of
the model variables on these 8 GPUs. Then like in MirroredStrategy(???), each
replica performs their computation with their own copy of variables unless in
cross-replica model where variable or tensor reduction happens.
2)
auto_shard_dataset: whether to auto-shard the dataset when there are
multiple workers.
This parameter is False by default.
EDIT:
So far I found that tf.estimator.train() after some time points to what seems to be strategy.make_input_fn_iterator():
def _get_iterator_from_input_fn(self, input_fn, mode, distribution=None):
if distribution is not None:
iterator = distribution.make_input_fn_iterator(
lambda _: self._call_input_fn(input_fn, mode))
input_hooks = [
estimator_util.DistributedIteratorInitializerHook(iterator)]
else:
result = self._call_input_fn(input_fn, mode)
iterator = result.make_initializable_iterator()
input_hooks = [estimator_util._DatasetInitializerHook(iterator)]
return iterator, input_hooks
make_input_fn_iterator()
But it was removed from the code of MirroredStrategy and is no longer there! I don't understand how it works and where the dataset is actually split.
EDIT2: I can't find line make_input_fn_iterator in my tensorflow 1.12.0 distribution with grep. Seems like it's totally absent in the code.
Ok, after spending some time investigating github, I found that it is already different from my tf 1.12.0. So, going down in the local files of 1.12.0 gave me:
GANEstimator inherits tf.python.estimator.Estimator
Estimator.init():
# The distribute field contains an instance of DistributionStrategy.
self._train_distribution = self._config.train_distribute
Then the path down is:
tf.contrib.gan.GANEstimator -> tf.python.estimator.Estimator.train() -->
tf.python.estimator.Estimator._train_model(input_fn, hooks, saving_listeners) -->
._train_model_distributed(input_fn, hooks, saving_listeners) -->
._get_iterator_from_input_fn(input_fn, model_fn_lib.ModeKeys.TRAIN, self._train_distribution) -->
distribution.distribute_dataset(lambda: self._call_input_fn(input_fn, mode))
which calls in my case for MirrorredStrategy.distribute_dataset():
def distribute_dataset(self, dataset_fn):
if self._cluster_spec:
return values.MultiWorkerDataset(
partial(self._call_dataset_fn, dataset_fn), self._worker_device_map,
self._prefetch_on_device, self._auto_shard_dataset)
else:
return values.PerDeviceDataset(
self._call_dataset_fn(dataset_fn), self._devices,
self._prefetch_on_device)
tensorflow/python/training/distribute.py:
def _call_dataset_fn(self, dataset_fn):
result = dataset_fn()
if not isinstance(result, dataset_ops.Dataset):
raise ValueError(
"dataset_fn() must return a tf.data.Dataset when using a "
"DistributionStrategy.")
return result
I assume PerDeviceDataset is used, so finally I find these two classes in values.py:
class PerDeviceDataset(object):
"""Like `tf.data.Dataset` split devices, producing `PerDevice` data."""
def __init__(self, dataset, devices, prefetch_on_device=None):
self._devices = devices
# Default to using prefetching in graph mode, unless specified.
# TODO(priyag): Enable prefetching in eager mode.
self._prefetch_on_device = prefetch_on_device
if self._prefetch_on_device is None:
self._prefetch_on_device = not context.executing_eagerly()
assert not (self._prefetch_on_device and context.executing_eagerly()), (
"Prefetching is only supported in graph mode currently")
if self._prefetch_on_device:
self._dataset = dataset.apply(
prefetching_ops_v2.prefetch_to_devices(self._devices))
else:
# TODO(priyag): If dropping remainder is not appropriate, find another
# approach to distributing the dataset when not possible to divide evenly.
# Possibly not an issue when we start using PartitionedDataset.
self._dataset = dataset.batch(len(devices), drop_remainder=True)
def make_one_shot_iterator(self):
"""Get a one time use iterator for the distributed PerDeviceDataset."""
dataset_iterator = self._dataset.make_one_shot_iterator()
return PerDeviceDataIterator(dataset_iterator, self._devices,
self._prefetch_on_device)
def make_initializable_iterator(self):
"""Get an initializable iterator for the distributed PerDeviceDataset."""
dataset_iterator = self._dataset.make_initializable_iterator()
return PerDeviceDataIterator(dataset_iterator, self._devices,
self._prefetch_on_device)
class PerDeviceDataIterator(object):
"""An iterator (like `tf.data.Iterator`) into a `PerDeviceDataset`."""
def __init__(self, iterator, devices, prefetch_on_device=None):
self._iterator = iterator
self._devices = devices
self._prefetch_on_device = prefetch_on_device
#property
def initializer(self):
return self._iterator.initializer
def get_next(self, name=None):
"""Scatter the input across devices."""
if self._prefetch_on_device:
data_list = self._iterator.get_next(name=name)
index = dict(zip(self._devices, data_list))
else:
batch = self._iterator.get_next(name=name)
index = {}
def get_ith(i):
return lambda x: x[i]
for i, d in enumerate(self._devices):
index[d] = nest.map_structure(get_ith(i), batch)
if context.executing_eagerly():
with ops.device(d):
index[d] = nest.map_structure(array_ops.identity, index[d])
return regroup(index)
So, as far as I understand, and first, my dataset_fn() function is just called to obtain dataset object, and then a batch with size of number of GPUs is applied on top of it. Elements of this batch which must be actual batches defined in my dataset initialization inside dataset_fn() are assigned to different devices.
I'll provide some clarification in case it helps, but really not sure if that's your point.
does MirroredStrategy use only one worker?
Yes. MirroredStrategy is intended to work only on one Worker (a.k.a one node, one computer, ...)
I need to specify batch size equal to the capacity of one tower
No. You need to multiply the batch size to the sum of towers.
Note: For reference, Tower is a copy of the model, which is equal to the number of GPUs, also called replicas
From this Keras tutorial, here is how to simply calculate the batch size:
BATCH_SIZE_PER_REPLICA = 64
BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)
In that case, the batch size per GPU is 64. Then multiplied by the number of GPUs.
Why multiplying by the number of GPUs?
To compute the gradient and the loss. it will be divided by the total amount of batch size (and not the GPU batch size)
Weird wording:
This is comparing the MirroredStrategy to the Multi-WorkerStrategy. In the case of a cluster, your tower will be replicated to every worker (e.g. 2 nodes in this example). Each worker will be responsible to distribute the model to their GPUs (e.g. 4 GPUs in that case). In that example, you will have 8 copies of your models.
[...] Then like in MirroredStrategy(???), each replica performs their computation with their own copy of variables [...]
Whatever you use multi-workers or a single worker, each GPU (or replica) will compute their model independently and sync afterward.
I guess they mention that "copy of variables", because there is another distributed computing topology with a Parameter Server (ps) where the ps will gather the weights of all replicas, sum it, and redistribute it to all replicas for the next round.

How to speed up batch preparation when using Estimators API combined with tf.data.Dataset

I'd like to speed up my training routine that uses the Estimator API with input_fn wrote using tf.data.Dataset.
My implementation takes 2 second to prepare a batch of data and then runs training on GPU for 1 sec, and then start over preparing a batch. Which is really inefficient.
I'm looking for a way to prepare the batches asynchronously and upload them to GPU to speed up the training. Or alternatively for a way to cache datasets between invocations of input_fn (the dataset.cache() doesn't seems to be a good choice as the dataset has to be recreated on each input_fn invocation).
Here is a simplified version of my code:
def input_fn(filenames, labels, epochs):
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_read_wav, num_parallel_calls=num_map_threads)
if shuffle:
dataset = dataset.shuffle(buffer_size=len(labels))
dataset = dataset.map(_post_process, num_parallel_calls=num_map_threads)
dataset = dataset.map(lambda wav, label: ({'wav': wav}, label))
dataset = dataset.batch(128)
dataset = dataset.repeat(epochs) # to iterate over the training set forever
iterator = dataset.dataset.make_one_shot_iterator()
features, labels = iterator.get_next()
return features, labels
train_input_fn = lambda : input_fn(train_files, train_labels, None)
eval_input_fn = lambda : input_fn(eval_files, eval_labels, 1)
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn, max_steps=45000)
eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn)
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
I've noticed that the Estimator API is under active development and in the master branch of tensorflow the input_fn can return datasets already, so maybe I'm asking too early and this feature isn't ready yet. But if so, please provide a ticket where this implementation can be tracked.
Using tf.data.Dataset.cache() is indeed not a good choice since it will cache the whole dataset into memory, which takes time and might overflow your memory.
The way to go is to use tf.data.Dataset.prefetch() at the end of your pipeline, which will always make sure that the data pipeline holds buffer_size elements. It is usually enough to have buffer_size = 1 at the end:
dataset = ...
dataset = dataset.batch(128)
dataset = dataset.prefetch(1) # prefetch one batch
As explained by #mrry in this answer, you can also try to increase the number of prefetched batches a bit.
Typically it is most useful to add a small prefetch buffer (with perhaps just a single element) at the very end of the pipeline, but more complex pipelines can benefit from additional prefetching, especially when the time to produce a single element can vary.
If you still have a slow input pipeline compared to your GPU computations, you need to increase the number of threads working in parallel using the num_parallel_calls argument of tf.data.Dataset.map().
A few points to add to Olivier's answer, mostly from this post:
repeat before shuffle is slightly faster, at the downside of blurred epoch boundaries. This may be significant in rare cases, but I doubt it.
shuffle before mapping - this reduces the memory foot print of your shuffle buffer size, since it only needs to buffer the filenames rather than the file contents.
it makes more sense to me to apply the third map transform to the output of get_next() rather than the dataset - not sure if that affects speed much. You could also consider putting both other map calls in the same one to reduce scheduling issues.
experiment with repeat before batching. Probably won't make a difference, but might be minor. If you repeat before shuffle as mentioned above you'll have to.
as mentioned by Olivier, use prefetch.
Code with modifications:
def input_fn(filenames, labels, epochs):
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.repeat(epochs)
if shuffle:
dataset = dataset.shuffle(buffer_size=len(labels))
def combined_map_fn(*args):
return _post_process(_read_wav(*args))
dataset = dataset.map(combined_map_fn, num_parallel_calls=num_map_threads)
dataset = dataset.batch(128)
dataset = dataset.prefetch(1)
iterator = dataset.dataset.make_one_shot_iterator()
wavs, labels = iterator.get_next()
features = {'wav': wavs}
return features, labels

Is it possible to loop through all minibatches in a single tensorflow op using dataset/iterators?

I'm working with tf.data.dataset/iterator mechanism and trying to improve data loading performance. It occurred to me that offloading the entire minibatch loop from Python might help. My data is small enough that storing on CPU or GPU is no problem.
So, Is it possible to loop an optimizer node over a full minibatched epoch within a call to session.run?
The tensor returned by iterator.get_next() is only incremented once per session.run, which would seems to make it impossible to iterate through a dataset of minibatches... but if it could be done, my CPU would only have to touch the Python thread once per epoch.
UPDATE: #muskrat's suggestion to use tf.slice can be used for this purpose. See my subsequent non-answer with a schematic implementation of this using tf.while_loop. However, the question is whether this can be accomplished using dataset/iterators... and I'd still like to know.
From the description it seems that you already have the dataset preloaded as a constant on CPU/GPU, like at this example. That's certainly the first step.
Second, I suggest using tf.slice() to replicate the effect of the minibatch operation. In other words, just manually slice minibatches out of the preloaded constant (your dataset), and you should get the desired behavior. See for example the slice docs or this related post.
If that's not enough detail, please edit your question to include a code example (with mnist or something) and I can give more details.
This "answer" is an implementation of muskrat's tf.slice suggestion with the details of tf.while_loop worked out (with help from How to use tf.while_loop() in tensorflow and https://www.tensorflow.org/api_docs/python/tf/while_loop).
Unless your data and model are small enough that you're bottlenecked by Python I/O (like me!), this solution is probably academic.
Advantages:
Trains over minibatches without returning to the Python thread.
Uses only ops that have GPU implementations meaning that the entire graph can be placed in the GPU.
On my small dataset, which is presumably bottlenecked by Python I/O, this solution is twice the speed of my dataset/iteratior (which touches Python once per minibatch) and four times the speed of passing minibatches through feed_dict.
Disadvantages:
tf.while_loop is treacherous. It's challenging to understand when ops inside the loop's body are evaluated and when those they depend on are evaluated, particularly the (thin) official documentation and limited Stack Overflow coverage.
The missing documentation of tf.while_loop is that tensors outside the body of the loop are only evaluated once, even if inner ops depend on them. This means that optimization, model, and loss have to be defined in the loop. This limits flexibility if you'd like to e.g. be able to call validation loss ops between training epochs. Presumably this could be accomplished with tf.cond statements and the appropriate flags passed in via feed_dict. But not nearly as flexible or elegant as the dataset/iterator mechanism in tf.data.
Adding shuffling operations at each Epoch doesn't seem available on GPU.
Here's my schematic code (I've ommitted the variable and model definition for brevity):
def buildModel(info, training_data, training_targets):
graph = tf.Graph()
with graph.as_default():
# numBatches is passed in from Python once per Epoch.
batch_size = tf.placeholder(tf.float32, name = 'batch_size')
# Initializers for loop variables for tf.while_loop
batchCounter = tf.Variable(0, dtype=tf.float32, trainable=False)
lossList = tf.Variable(tf.zeros([0,1]), trainable=False)
# In a full example, I'd normalize my data here. And possibly shuffle
tf_training_data = tf.constant(training_data, dtype=tf.float32)
tf_training_targets = tf.constant(training_targets, dtype=tf.float32)
# For brevity, I'll spare the definitions of my variables. Because tf.Variables
# are essentially treated as globals in the model and are manipulated directly (like with tf.apply)
# they can reside outside runMinibatch, the body of tf.while_loop.
# weights_1 =
# biases_1 =
# etc.
def moreMinibatches(batchCount, lossList):
return (batchCount + 1) * batch_size <= len(training_data)
def runMinibatch(batchCount, lossList):
# These tensors and ops have to be defined inside runMinibatch, otherwise they're not updated as tf.wile_loop loops. This means
# slices, model definition, loss tensor, and training op.
dat_batch = tf.slice(tf_training_data, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
targ_batch = tf.slice(tf_training_targets, [tf.cast(batchCounter * batch_size, tf.int32) , 0], [tf.cast(batch_size, tf.int32), -1])
# Here's where you'd define the model as a function of weights and biases above and dat_batch
# model = <insert here>
loss = tf.reduce_mean(tf.squared_difference(model, targ_batch))
optimizer = tf.train.AdagradOptimizer() # for example
train_op = optimizer.minimize(while_loss, name='optimizer')
# control_dependences ensures that train_op is run before return
# even though the return values don't explicitly depend on it.
with tf.control_dependencies([train_op]):
return batchCount + 1, tf.concat([lossList, [[while_loss]]],0)
# So, the idea is that this trains a full epoch without returning to Python.
trainMinibatches = tf.while_loop(moreMinibatches, runMinibatch, [minibatchCounter, lossList]
shape_invariants=[batchCounter.get_shape(), tf.TensorShape(None)])
return (graph,
{'trainMinibatches' : trainAllMinibatches,
'minibatchCounter' : minibatchCounter,
'norm_loss' : norm_loss,
} )
numEpochs = 100 # e.g.
minibatchSize = 32 #
# training_dataset = <data here>
# training_targets = <targets here>
graph, ops = buildModel(info, training_dataset, training_targets,
minibatch_size)
with tf.Session(graph=graph, config=config) as session:
tf.global_variables_initializer().run()
for i in range(numEpochs):
# This op will train on as all minibatches that fit in the full dataset. finalBatchCount with be the number of
# complete minibatches in the dataset. lossList is a list of each step's minibatches.
finalBatchCount, lossList = session.run(ops['trainAllMinibatches'],
feed_dict={'batch_size:0':minibatchSize})
print('minibatch losses at Epoch', i, ': ', lossList)
I implemented tf.slice() and tf.while_loop approach to vectorize mini-batch suggested above.
The performance was about 1.86 times faster in my case than the mini-batches using feed_dict, but I found there was a problem that the loss values of each epochs were not stabilized.
Then, I changed to tf.random_shuffle the inputs every epoch, the problem was much mitigated. (the performance gain was reduced to 1.68 times)

Tensorflow input pipeline

I have an input pipeline where samples are generated on fly. I use keras and custom ImageDataGenerator and corresponding Iterator to get samples in memory.
Under assumption that keras in my setup is using feed_dict (and that assumption is a question to me) I am thinking of speeding things up by switching to raw tensorflow + Dataset.from_generator().
Here I see that suggested solution for input pipelines that generate data on fly in the most recent Tensorflow is to use Dataset.from_generator().
Questions:
Does keras with Tensorflow backend use feed_dict method?
If I switch to raw tensorflow + Dataset.from_generator(my_sample_generator) will that cut feed_dict memory copy overhead and buy me performance?
During predict (evaluation) phase apart from batch_x, batch_y I have also opaque index vector from my generator output. That vector corresponds to sample ids in the batch_x. Does that mean that I'm stuck with feed_dict approach for predict phase because I need that extra batch_z output from iterator?
The new tf.contrib.data.Dataset.from_generator() can potentially speed up your input pipeline by overlapping the data preparation with training. However, you will tend to get the best performance by switching over to TensorFlow ops in your input pipeline wherever possible.
To answer your specific questions:
The Keras TensorFlow backend uses tf.placeholder() to represent compiled function inputs, and feed_dict to pass arguments to a function.
With the recent optimizations to tf.py_func() and feed_dict copy overhead, I suspect the amount of time spent in memcpy() will be the same. However, you can more easily use Dataset.from_generator() with Dataset.prefetch() to overlap the training on one batch with preprocessing on the next batch.
It sounds like you can define a separate iterator for the prediction phase. The tf.estimator.Estimator class does something similar by instantiating different "input functions" with different signatures for training and evaluation, then building a separate graph for each role.
Alternatively, you could add a dummy output to your training iterator (for the batch_z values) and switch between training and evaluation iterators using a "feedable iterator".