Is there a way (or maybe a list?) to determine all the tensorflow operations which offer GPU support?
Right now, it is a trial and error process for me - I try to place a group of operations on GPU. If it works, sweet. If not, try somewhere else.
This is the only thing relevant (but not helpful) I have found so far: https://github.com/tensorflow/tensorflow/issues/2502
Unfortunately, it seems there's no built-in way to get an exact list or even check this for a specific op. As mentioned in the issue above, the dispatching is done in the native C++ code: a particular operation can be assigned to GPU if a corresponding kernel has been registered to DEVICE_GPU.
I think the easiest way for you is to grep "REGISTER_KERNEL_BUILDER" -r tensorflow the tensorflow source base to get a list of matched operations, which will look something like this.
But remember that even with REGISTER_KERNEL_BUILDER specification, there's no guarantee an op will be performed on a GPU. For example, 32-bit int Add is assigned on CPU regardless of the existing kernel.
Related
I am getting started on using tensorflow-quantum for some QML circuit simulations. I have everything configured correctly for TensorFlow with GPU, and when I run print(tf.config.list_physical_devices('GPU')), it reports the presence of my GPU.
However, I've done some Googling, and I've come across a few things suggesting that tensorflow-quantum doesn't actually support GPU acceleration for simulations (e.g. MichaelBroughton's first reply here, and this issue which is still open). However, it's unclear to me how up-to-date this state of affairs is. I can't find anything about adding GPU support in the version notes.
Does tensorflow-quantum currently support GPU? If so, how do I (a) make it use my GPU for simulations and (b) verify that it is doing so?
I'm running machine learning (ML) jobs that make use of very little GPU memory.
Thus, I could run multiple ML jobs on a single GPU.
To achieve that, I would like to add multiple lines in the gres.conf file that specify the same device.
However, it seems the slurm deamon doesn't accept this, the service returning:
fatal: Gres GPU plugin failed to load configuration
Is there any option I'm missing to make this work?
Or maybe a different way to achieve that with SLURM?
It is kind of smiliar to this one, but this one seems specific to some CUDA code with compilation enabled. Something which seems way more specific than my general case (or at least as far as I understand).
How to run multiple jobs on a GPU grid with CUDA using SLURM
I don't think you can oversubscribe GPUs, so I see two options:
You can configure the CUDA Multi-Process Service or
pack multiple calculations into a single job that has one GPU and run them in parallel.
Besides nVidia MPS mentioned by #Marcus Boden, which is relevant for V100 types of cards, there also is Multi-Instance GPU which is relevant for A100 types of cards.
I am running some code on an embedded system with an extremely limited memory, and even more limited processing power.
I am using TensorFlow for this implementation.
I have never had to work in this kind of environment before.
What are some steps I can take to ensure I am being efficient as possible in my implementations/optimization?
Some ideas -
- Pruning code -
https://jacobgil.github.io/deeplearning/pruning-deep-learning
- Ensure loops are as minimal as possible (in the big O sense)
- ...
Thanks a lot.
I suggest using TensorFlow Lite.
It will enable you to compress and quantize your model to make it smaller and faster to run.
It also supports leveraging GPU and/or hardware accelerator if any of this is available to you.
https://www.tensorflow.org/lite
If you are working with TensorFlow 1.13 (the latest stable version before the 2.0 prototype), there is a pruning function from tf.contrib submodule. It contains a sparcity parameter that you can tune to determine the size of the network.
I suggest you to take a look at all the tf.contrib.model_pruning submodule here. It's plenty of functions you might need for your specific task.
As far as I can see, TensorFlow is designed to have fully static device placement across a single tf.Session.run(). Is there a known ideal location to insert code for on-the-fly changing of operation device placement?
I'm aware of the static methods at a python level, but I'm looking for something at a C++ level such that I can do something akin to load balancing.
As an example, let's say I want TensorFlow to schedule operations to CPU and GPU in an alternating fashion (hardly ideal, I know). How might I do this at runtime so as operation dependencies are resolved and more operations are scheduled the environment of an operation is updated to be a different device? Would this best be done using the DeviceMgr to change execution device for the environment of a given operation in ExecutorState::Process(TaggedNode tagged_node, int64 scheduled_usec) right before the operation is launched (line 1651 of executor.cc)? Or am I misunderstanding when an operation is scheduled for execution through XLA and when is the latest time I can change the device placement?
I am somewhat familiar with the CUDA visual profiler and the occupancy spreadsheet, although I am probably not leveraging them as well as I could. Profiling & optimizing CUDA code is not like profiling & optimizing code that runs on a CPU. So I am hoping to learn from your experiences about how to get the most out of my code.
There was a post recently looking for the fastest possible code to identify self numbers, and I provided a CUDA implementation. I'm not satisfied that this code is as fast as it can be, but I'm at a loss as to figure out both what the right questions are and what tool I can get the answers from.
How do you identify ways to make your CUDA kernels perform faster?
If you're developing on Linux then the CUDA Visual Profiler gives you a whole load of information, knowing what to do with it can be a little tricky. On Windows you can also use the CUDA Visual Profiler, or (on Vista/7/2008) you can use Nexus which integrates nicely with Visual Studio and gives you combined host and GPU profile information.
Once you've got the data, you need to know how to interpret it. The Advanced CUDA C presentation from GTC has some useful tips. The main things to look out for are:
Optimal memory accesses: you need to know what you expect your code to do and then look for exceptions. So if you are always loading floats, and each thread loads a different float from an array, then you would expect to see only 64-byte loads (on current h/w). Any other loads are inefficient. The profiling information will probably improve in future h/w.
Minimise serialization: the "warp serialize" counter indicates that you have shared memory bank conflicts or constant serialization, the presentation goes into more detail and what to do about this as does the SDK (e.g. the reduction sample)
Overlap I/O and compute: this is where Nexus really shines (you can get the same info manually using cudaEvents), if you have a large amount of data transfer you want to overlap the compute and the I/O
Execution configuration: the occupancy calculator can help with this, but simple methods like commenting the compute to measure expected vs. measured bandwidth is really useful (and vice versa for compute throughput)
This is just a start, check out the GTC presentation and the other webinars on the NVIDIA website.
If you are using Windows... Check Nexus:
http://developer.nvidia.com/object/nexus.html
The CUDA profiler is rather crude and doesn't provide a lot of useful information. The only way to seriously micro-optimize your code (assuming you have already chosen the best possible algorithm) is to have a deep understanding of the GPU architecture, particularly with regard to using shared memory, external memory access patterns, register usage, thread occupancy, warps, etc.
Maybe you could post your kernel code here and get some feedback ?
The nVidia CUDA developer forum forum is also a good place to go for help with this kind of problem.
I hung back because I'm no CUDA expert, and the other answers are pretty good IF the code is already pretty near optimal. In my experience, that's a big IF, and there's no harm in verifying it.
To verify it, you need to find out if the code is for sure not doing anything it doesn't really have to do. Here are ways I can see to verify that:
Run the same code on the vanilla processor, and either take stackshots of it, or use a profiler such as Oprofile or RotateRight/Zoom that can give you equivalent information.
Running it on a CUDA processor, and doing the same thing, if possible.
What you're looking for are lines of code that have high occupancy on the call stack, as shown by the fraction of stack samples containing them. Those are your "bottlenecks". It does not take a very large number of samples to locate them.