Distributed Tensorflow Using fit_generator() - tensorflow

Is it possible to use Tensorflow in a distributed manner and use the fit_generator()? In my research so far I have not seen anything on how to do this or if it is possible. If it is not possible then what are some possible solutions to use distributed Tensorflow when all the data will not fit in memory.

Using fit_generator() is not possible under a tensorflow distribution scope.
have a lookt at tf.data. i rewrote all my Keras ImageDataGenerators to a tensorflow data pipeline. doesn't need much time, is more transparent and quite remarkably faster.

Related

Tensorflow profiling for non-model computations

I have a computation which has for loops and calls to Tensorflow matrix algorithms such as tf.lstsq and Tensorflow iteration with tf.map_fn. I would like to profile this to see how much parallelism I am getting in the tf.map_fn and matrix algorithms that get called.
This doesn't seem to be the use case at all for the Tensorflow Profiler which is organized around the neural network model training loop.
Is there a way to use Tensorflow Profiler for arbitrary Tensorflow computations, or is the go-to move in this case to use NVidia tools like nvprof?
I figured out that the nvprof and nvvp and nsight tools I was looking for are available as a Conda install of cudatoolkit-dev. Uses are described in this gist.

How to do parallel GPU inferencing in Tensorflow 2.0 + Keras?

Let's begin with the premise that I'm newly approaching to TensorFlow and deep learning in general.
I have TF 2.0 Keras-style model trained using tf.Model.train(), two available GPUs and I'm looking to scale down inference times.
I trained the model distributing across GPUs using the extremely handy tf.distribute.MirroredStrategy().scope() context manager
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model.compile(...)
model.train(...)
both GPUs get effectively used (even if I'm not quite happy with the results accuracy).
I can't seem to find a similar strategy for distributing inference between GPUs with the tf.Model.predict() method: when i run model.predict() I get (obviously) usage from only one of the two GPUs.
Is it possible to istantiate the same model on both GPUs and feed them different chunks of data in parallel?
There are posts that suggest how to do it in TF 1.x but I can't seem to replicate the results in TF2.0
https://medium.com/#sbp3624/tensorflow-multi-gpu-for-inferencing-test-time-58e952a2ed95
Tensorflow: simultaneous prediction on GPU and CPU
my mental struggles with the question are mainly
TF 1.x is tf.Session()based while sessions are implicit in TF2.0, if I get it correctly, the solutions I read use separate sessions for each GPU and I don't really know how to replicate it in TF2.0
I don't know how to use the model.predict() method with a specific session.
I know that the question is probably not well-formulated but I summarize it as:
Does anybody have a clue on how to run Keras-style model.predict() on multiple GPUs (inferencing on a different batch of data on each GPU in a parallel way) in TF2.0?
Thanks in advance for any help.
Try to load model in tf.distribute.MirroredStrategy and use greater batch_size
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = tf.keras.models.load_model(saved_model_path)
result = model.predict(batch_size=greater_batch_size)
There still does not seem to be an official example for distributed inference. There is a potential solution here using tf.distribute.MirroredStrategy: https://github.com/tensorflow/tensorflow/issues/37686. However, it does not seem to fully utilize multi gpus

fast.ai equivalent in tensorflow

Is there any equivalent/alternate library to fastai in tensorfow for easier training and debugging deep learning models including analysis on results of trained model in Tensorflow.
Fastai is built on top of pytorch looking for similar one in tensorflow.
The obvious choice would be to use tf.keras.
It is bundled with tensorflow and is becoming its official "high-level" API -- to the point where in TF 2 you would probably need to go out of your way not using it at all.
It is clearly the source of inspiration for fastai to easy the use of pytorch as Keras does for tensorflow, as mentionned by the authors time and again:
Unfortunately, Pytorch was a long way from being a good option for part one of the course, which is designed to be accessible to people with no machine learning background. It did not have anything like the clear simple API of Keras for training models. Every project required dozens of lines of code just to implement the basics of training a neural network. Unlike Keras, where the defaults are thoughtfully chosen to be as useful as possible, Pytorch required everything to be specified in detail. However, we also realised that Keras could be even better. We noticed that we kept on making the same mistakes in Keras, such as failing to shuffle our data when we needed to, or vice versa. Also, many recent best practices were not being incorporated into Keras, particularly in the rapidly developing field of natural language processing. We wondered if we could build something that could be even better than Keras for rapidly training world-class deep learning models.

What exactly is "Tensorflow Distibuted", now that we have Tensorflow Serving?

I don't understand why "Tensorflow Distributed" still exists, now that we have Tensorflow Serving. It seems to be some way to use core Tensorflow as a serving platform, but why would we want that when Tensorflow Serving and TFX is a much more robust platform? Is it just legacy support? If so, then the Tensorflow Distributed pages should make that clear and point people towards TFX.
Distributed Tensorflow can support training one model in many machines by implementing a parameter server, with either data parallelism or model parallelism.

Seq2Seq use of buckets in TensorFlow tutorial

I am wondering why the buckets are being introduced in the Seq2Seq TensorFlow tutorial. I understand the efficiency gain from not processing the padding symbols, but you can avoid processing the paddings if you use rnn and specify the sequence_length parameter. Or if you use dynamic_rnn.
Is it because it helps distributing the training across multiple devices / machines ?
One reason is that seq2seq was created before dynamic rnn was available. The other is that, even with dynamic rnn, it still helps for speed if your batches are organized by bucket.