I have a computation which has for loops and calls to Tensorflow matrix algorithms such as tf.lstsq and Tensorflow iteration with tf.map_fn. I would like to profile this to see how much parallelism I am getting in the tf.map_fn and matrix algorithms that get called.
This doesn't seem to be the use case at all for the Tensorflow Profiler which is organized around the neural network model training loop.
Is there a way to use Tensorflow Profiler for arbitrary Tensorflow computations, or is the go-to move in this case to use NVidia tools like nvprof?
I figured out that the nvprof and nvvp and nsight tools I was looking for are available as a Conda install of cudatoolkit-dev. Uses are described in this gist.
Related
Okay, so I've worked on a bunch of Deep Learning projects and internships now and I've never had to do heavy training. But lately I've been thinking of doing some Transfer Learning for which I'll need to run my code on a GPU. Now I have a system with Windows 10 and a dedicated NVIDIA GeForce 940M GPU. I've been doing a lot of research online, but I'm still a bit confused. I haven't installed the NVIDIA Cuda Toolkit or cuDNN or tensorflow-gpu on my system yet. I currently use tensorflow and pytorch to train my DL models. Here are my queries -
When I define a tensor in tf or pytorch, it is a cpu tensor by default. So, all the training I've been doing so far has been on the CPU. So, if I make sure to install the correct versions of Cuda and cuDNN and tensorflow-gpu (specifically for tensorflow), I can run my models on my GPU using tf-gpu and pytorch and that's it? (I'm aware of the torch.cuda.is_available() in pytorch to ensure pytorch can access my GPU and the device_lib module in tf to check if my gpu is visible to tensorflow)(I'm also aware of the fact that tf doesnt support all Nvidia GPUs)
Why does tf have a separate module for GPU support? PyTorch doesnt seem to have that and all you need to do is cast your tensor from cpu() to cuda() to switch between them.
Why install cuDNN? I know it is a high-level API CUDA built for support to train Deep Neural Nets on the GPU. But do tf-gpu and torch use these in the backend while training on the gpu?
After tf == 1.15, did they combine CPU and GPU support all into one package?
First of all unfortunately 940M is a kinda weak GPU for training. I suggest you use Google colab for faster training but of course, it would be faster than the CPU. So here my answers to your four questions.
1-) Yes if you install the requirements correctly, then you can run on GPU. You can manually place your data to your GPU as well. You can check implementations on TensorFlow. In PyTorch, you should specify the device that you want to use. As you said you should do device = torch.device("cuda" if args.cuda else "cpu") then for models and data you should always call .to(device) Then it will automatically use GPU if available.
2-) PyTorch also needs extra installation (module) for GPU support. However, with recent updates both TF and PyTorch are easy to use for GPU compatible code.
3-) Both Tensorflow and PyTorch is based on cuDNN. You can use them without cuDNN but as far as I know, it hurts the performance but I'm not sure about this topic.
4-) No they are still different packages. tensorflow-gpu==1.15 and tensorflow==1.15 what they did with tf2, was making the tensorflow more like Keras. So it is more simplified then 1.15 or before.
Rest was already answered by regarding 3) cudNN optimizes layer and such operations on hardware level and those implementations are pure black magic. It is incredibly hard to write CUDA code that properly utilizes your GPU (how load data into the GPU, how to actually perform them using matrices etc. )
I have been using TensorRT and TensorFlow-TRT to accelerate the inference of my DL algorithms.
Then I have heard of:
JAX https://github.com/google/jax
Trax https://github.com/google/trax
Both seem to accelerate DL. But I am having a hard time to understand them. Can anyone explain them in simple terms?
Trax is a deep learning framework created by Google and extensively used by the Google Brain team. It comes as an alternative to TensorFlow and PyTorch when it comes to implementing off-the-shelf state of the art deep learning models, for example Transformers, Bert etc. , in principle with respect to the Natural Language Processing field.
Trax is built upon TensorFlow and JAX. JAX is an enhanced and optimised version of Numpy. The important distinction about JAX and NumPy is that the former using a library called XLA (advanced linear algebra) which allows to run your NumPy code on GPU and TPU rather than on CPU like it happens in the plain NumPy, thus speeding up computation.
Let's begin with the premise that I'm newly approaching to TensorFlow and deep learning in general.
I have TF 2.0 Keras-style model trained using tf.Model.train(), two available GPUs and I'm looking to scale down inference times.
I trained the model distributing across GPUs using the extremely handy tf.distribute.MirroredStrategy().scope() context manager
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model.compile(...)
model.train(...)
both GPUs get effectively used (even if I'm not quite happy with the results accuracy).
I can't seem to find a similar strategy for distributing inference between GPUs with the tf.Model.predict() method: when i run model.predict() I get (obviously) usage from only one of the two GPUs.
Is it possible to istantiate the same model on both GPUs and feed them different chunks of data in parallel?
There are posts that suggest how to do it in TF 1.x but I can't seem to replicate the results in TF2.0
https://medium.com/#sbp3624/tensorflow-multi-gpu-for-inferencing-test-time-58e952a2ed95
Tensorflow: simultaneous prediction on GPU and CPU
my mental struggles with the question are mainly
TF 1.x is tf.Session()based while sessions are implicit in TF2.0, if I get it correctly, the solutions I read use separate sessions for each GPU and I don't really know how to replicate it in TF2.0
I don't know how to use the model.predict() method with a specific session.
I know that the question is probably not well-formulated but I summarize it as:
Does anybody have a clue on how to run Keras-style model.predict() on multiple GPUs (inferencing on a different batch of data on each GPU in a parallel way) in TF2.0?
Thanks in advance for any help.
Try to load model in tf.distribute.MirroredStrategy and use greater batch_size
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = tf.keras.models.load_model(saved_model_path)
result = model.predict(batch_size=greater_batch_size)
There still does not seem to be an official example for distributed inference. There is a potential solution here using tf.distribute.MirroredStrategy: https://github.com/tensorflow/tensorflow/issues/37686. However, it does not seem to fully utilize multi gpus
Is it possible to use Tensorflow in a distributed manner and use the fit_generator()? In my research so far I have not seen anything on how to do this or if it is possible. If it is not possible then what are some possible solutions to use distributed Tensorflow when all the data will not fit in memory.
Using fit_generator() is not possible under a tensorflow distribution scope.
have a lookt at tf.data. i rewrote all my Keras ImageDataGenerators to a tensorflow data pipeline. doesn't need much time, is more transparent and quite remarkably faster.
I am making a neural network using tensorflow and I ran into a problem trying to use a generator to split my data up, basically it's too slow.
My training data consists of 52x52 numpy arrays. I need to split each array into a 52x52x3 array before I input it into my NN. As mentioned I have a generator working that does this, but I noticed that even though my NN is running on the GPU my GPU usage is very low (under 10% usually). I think this might be caused by me doing the generator on the CPU.
Is there any way of running my generator on the GPU?
What I tried:
- I thought of trying to use pyCUDA in order to program the generator on the GPU but found that tensorflow and pyCUDA don't support each other
-I tried using the from_generator function from the Dataset API as mentioned here:
https://www.tensorflow.org/api_docs/python/tf/contrib/data/Dataset
But while having issues with it I ran into this github thread mentioning that this function isn't supported to run on the GPU anyway:
https://github.com/tensorflow/tensorflow/issues/13610
Any help would be greatly appreciated.