How to use TF 2.0 tf.recompute_grad? - tensorflow

I wanted to use memory saving gradients (openai/gradient-checkpointing) to reduce GPU memory cost of my neural network, but I found that it's impossible in TF 2.0, but also I have found that I can use tf.recompute_grad for this purpose. I didn't found any examples or tutorials on Google, so I ask here. Also, is it possible to use this with tf.keras?

Related

What is the difference between JAX, Trax, and TensorRT, in simple terms?

I have been using TensorRT and TensorFlow-TRT to accelerate the inference of my DL algorithms.
Then I have heard of:
JAX https://github.com/google/jax
Trax https://github.com/google/trax
Both seem to accelerate DL. But I am having a hard time to understand them. Can anyone explain them in simple terms?
Trax is a deep learning framework created by Google and extensively used by the Google Brain team. It comes as an alternative to TensorFlow and PyTorch when it comes to implementing off-the-shelf state of the art deep learning models, for example Transformers, Bert etc. , in principle with respect to the Natural Language Processing field.
Trax is built upon TensorFlow and JAX. JAX is an enhanced and optimised version of Numpy. The important distinction about JAX and NumPy is that the former using a library called XLA (advanced linear algebra) which allows to run your NumPy code on GPU and TPU rather than on CPU like it happens in the plain NumPy, thus speeding up computation.

How to get the exact GPU memory usage for Keras

I recently started learning Keras and TensorFlow. I am testing out a few models currently on the MNIST dataset (pretty basic stuff). I wanted to know, exactly how much my model is consuming memory-wise, during training and inference. I tried googling but did not find much info.
I came across Nvidia-smi. I tried using config.gpu_options.allow_growth = True option but still am not able to use the exact memory python.exe is consuming due to some issues with Nvidia-smi. I know that I could run a separate pass of train and inference, but this is too cumbersome. It is very easy if I could just find the right API to do the job.
Tensorflow being such a well known and well-used library, I am hoping to find a better and faster way to get to these numbers.
Finally, once again my question is:
How to get the exact memory usage for a Keras model during training and inference.
Relevant specs:
OS: Windows 10
GPU: GTX 1050
TensorFlow version: 1.14
Please let me know if any other details are required.
Thanks!

Distributed Tensorflow Using fit_generator()

Is it possible to use Tensorflow in a distributed manner and use the fit_generator()? In my research so far I have not seen anything on how to do this or if it is possible. If it is not possible then what are some possible solutions to use distributed Tensorflow when all the data will not fit in memory.
Using fit_generator() is not possible under a tensorflow distribution scope.
have a lookt at tf.data. i rewrote all my Keras ImageDataGenerators to a tensorflow data pipeline. doesn't need much time, is more transparent and quite remarkably faster.

fast.ai equivalent in tensorflow

Is there any equivalent/alternate library to fastai in tensorfow for easier training and debugging deep learning models including analysis on results of trained model in Tensorflow.
Fastai is built on top of pytorch looking for similar one in tensorflow.
The obvious choice would be to use tf.keras.
It is bundled with tensorflow and is becoming its official "high-level" API -- to the point where in TF 2 you would probably need to go out of your way not using it at all.
It is clearly the source of inspiration for fastai to easy the use of pytorch as Keras does for tensorflow, as mentionned by the authors time and again:
Unfortunately, Pytorch was a long way from being a good option for part one of the course, which is designed to be accessible to people with no machine learning background. It did not have anything like the clear simple API of Keras for training models. Every project required dozens of lines of code just to implement the basics of training a neural network. Unlike Keras, where the defaults are thoughtfully chosen to be as useful as possible, Pytorch required everything to be specified in detail. However, we also realised that Keras could be even better. We noticed that we kept on making the same mistakes in Keras, such as failing to shuffle our data when we needed to, or vice versa. Also, many recent best practices were not being incorporated into Keras, particularly in the rapidly developing field of natural language processing. We wondered if we could build something that could be even better than Keras for rapidly training world-class deep learning models.

Why the MobileNetV2 is faster than MobileNetV1 only at mobile device?

I am studying about Google's brandnew MobileNetV2 architecture.
During studying, I've read this string at Tensorflow model zoo Github
'For example Mobilenet V2 is faster on mobile devices than Mobilenet V1, but is slightly slower on desktop GPU.'
So, my question is,
How that could be possible? I really want to know why.
From https://arxiv.org/abs/1903.08469v1 :
"However, MobileNet V2 uses depthwise separable convolutions which are not directly supported in GPU firmware (the cuDNN library). Therefore, MobileNet V2 tends to be slower than ResNet18 in most experimental setups. Note that the same issue disqualifies usage of the DenseNet architecture [12], since it requires efficient convolution over a non-contiguous tensor, which is still not supported in cuDNN."
From their published paper at MobileNetV2: Inverted Residuals and Linear Bottlenecks,
under subtopic number 5: Implementation Notes, 5.1. Memory efficient inference;
The inverted residual bottleneck layers allow a particularly
memory efficient implementation which is very
important for mobile applications. (and more in paper)
According to TensorFlow team, it's optimized smaller in size can also be used as TF Lite. As far as we know TF Lite is indeed for mobile use. It's much slower on desktop GPU probably V2 has more conv layers compared to V1 which make sense if the training tooks more times to finish. For now, we didn't do the training and inferencing of data on mobile because of computational speed hunger which lead to power hunger as well.
Hope I answer the question.