On an NVIDIA GPU with multiple graphics cards (K80 for example), why does torch.cuda.device_count() return 1? - tensorflow

I ran the following code on a Tesla K80, which as I understand consists of 2 GK210 graphics cards, each with 12GB of on chip ram, connected by something called a PLX switch. I am confused how at the pytorch level, the fact that there are two graphics cards is hidden from the user
import torch
torch.cuda.device_count() # 1
(my hunch is that tensorflow provides this same abstraction)
Follow up questions:
If I am training a model with pytorch, and I run nvidia-smi and see that the GPU is fully utilized, I would assume this means that both GK210's are at 100% utilization. How does pytorch distribute kernels across the two GK210's, and can I have faith that this is being done efficiently? (I.e. being done in a way that minimized data transfer between the two cards). Any resources that explain how this works would be much appreciated.
If I were writing a CUDA application, could I pin a CUDA stream to each card, and explicitly manage data transfers between the two cards?

Related

Running the same detection model on different GPUs

I recently ran in to a bit of a glitch where my detection model running on two different GPUs (a Quadro RTX4000 and RTX A4000) on two different systems utilize the GPU differently.
The model uses only 0.2% of GPU on the Quadro system and uses anywhere from 50 to 70% on the A4000 machine. I am curious about why this is happening. The rest of the hardware on both the machines are the same.
Additional information: The model uses a 3D convolution and is built on tensorflow.
Looks like the Quadro RTX4000 does not use GPU.
The method tf.test.is_gpu_available() is deprecated and can still return True although the GPU is not used.
The correct way to verify the usage of the GPU availability + usage is to check the output of the snippet:
tf.config.list_physical_devices('GPU')
On the Quadro machine you should also run (in terminal):
watch -n 1 nvidia-smi
to see real-time the amount of GPU memory used.

Since TensorflowJS can use the GPU via WebGL, why would I need an nVIDIA GPU?

So TensorFlowJS can use WebGL to do GPU computations and train deep learning models. Why isn't this more popular than using CUDA with an nVIDIA GPU? Most people just trying to prototype machine learning models would love to do so on their personal computer, but many of us resort to using expensive cloud services like AWS (although more recently Google Colab helps) for ML training if we don't have a computer with an nVIDIA GPU. I'm sure nVIDIA GPUs are faster than whatever GPU is in my Macbook, but probably any GPU will offer at least an order of magnitude speedup over even a fast CPU and allow for model prototyping, so why aren't well using WebGL GPGPU? There must be a catch I just don't know about.
WebGL backend uses GLSL language to define functions and upload data as shaders - it "works", but you pay huge cost to compile GSLS and upload shaders: warmup time for semi-complex models is immense (we're talking about minutes just to startup). And then memory overhead is 100-200% of what model would normally need - and for larger models, you're GPU memory bound, you don't want to waste that.
Btw, actual inference time once model is warmed up and it fits in memory is ok using WebGL
On the other hand nVidia CUDA libraries provide direct access to GPU, so TF compiled to use them is always going to be much more efficient.
Unfortunately, not many GPU vendors provide libraries like CUDA, so most ML is done on nVidia GPUs
Then there is a next level when you're using TPU instead of GPU - then there is no WebGL to start with
If I select WebGPU with the TFJS benchmark (https://tensorflow.github.io/tfjs/e2e/benchmarks/local-benchmark/index.html) it responds with "WebGPU is not supported. Please use Chrome Canary browser with flag "--enable-unsafe-webgpu" enabled...."
So when that's ready will it be competitive with CUDA? On my laptop it is about 15% faster than WebGL on that benchmark.

CUDA programming: Is occupancy the way to achieve GPU slicing among different process?

There are ways through which GPU sharing can be achieved. I came across occupancy. Can I use it to slice the GPU among the processes (e.g. tensorflow) which are sharing GPU? slice here means GPU resources are always dedicated to that process. Using occupancy I will get know GPU & SMs details and based on that I launch kernel stating that create blocks for these GPU resources.
I am using NVIDIA Corporation GK210GL [Tesla K80] with cuda 9 toolkit installed
Please suggest. Thanks!
There are ways through which GPU sharing can be achieved.
No there aren't. In general, there is no such thing as the type of GPU sharing that you envisage. There is the MPS server for MPI style multi process computing, but that is irrelevant in the context of running Tensorflow (see here for why MPS can't be used).
I came across occupancy. Can I use it to slice the GPU among the processes (e.g. tensorflow) which are sharing GPU?
No you can't. Occupancy is a performance metric. It has nothing whatsoever to do with the ability to share a GPUs resources amongst different processes,
Please suggest
Buy a second GPU.

Will adding GPU cards automatically scale tensorflow usage?

Suppose I can train with sample size N, batch size M and network depth L on my GTX 1070 card with tensorflow. Now, suppose I want to train with larger sample 2N and/or deeper network 2L and getting out of memory error.
Will plugging additional GPU cards automatically solve this problem (suppose, that total amount of memory of all GPU cards is sufficient to hold batch and it's gradients)? Or it is impossible with pure tensorflow?
I'v read, that there are bitcoin or etherium miners, that can build mining farm with multiple GPU cards and that this farm will mine faster.
Will mining farm also perform better for deep learning?
Will plugging additional GPU cards automatically solve this problem?
No. You have to change your Tensorflow code to explicitly compute different operations on different devices (e.g: compute the gradients over a single batch on every GPU, then send the computed gradients to a coordinator that accumulates the received gradients and updates the model parameters averaging these gradients).
Also, Tensorflow is so flexible that allows you to specify different operations for every different device (or different remote nodes, it's the same).
You could do data augmentation on a single computational node and let the others process the data without applying this function. You can execute certain operation on a device or set of devices only.
it is impossible with pure tensorflow?
It's possible with tensorflow, but you have to change the code you wrote for a single train/inference device.
I'v read, that there are bitcoin or etherium miners, that can build mining farm with multiple GPU cards and that this farm will mine faster.
Will mining farm also perform better for deep learning?
Blockchains that work using POW (Proof Of Work) requires to solve a difficult problem using a brute-force like approach (they compute a lot's of hash with different inputs until they found a valid hash).
That means that if your single GPU can guess 1000 hash/s, 2 identical GPUs can guess 2 x 1000 hash/s.
The computation the GPUs are doing are completely uncorrelated: the data produced by the GPU:0 is not used by the GPU:1 and there are no synchronization points between the computations. This means that the task that a GPU do can be executed in parallel by another GPU (obviously with different inputs per GPU, so the devices compute hashes to solve different problems given by the network)
Back to Tensorflow: once you modified your code to work with different GPUs, you could train your network faster (in short because you're using bigger batches)

Gaming GPUs and TensorFlow

I went through the MNIST tutorial with conv nets and during the training -for the first time- felt the need to use a GPU. I have a Geforce GTX 830M on my laptop and was wondering if I could use it with tensorflow?
Should I invest the time to try to get it working or start searching for a low cost GPU with the right requirements?
[I've been reading about very expensive and highly specialized equipment like the nvidia Digits, equipment's with half precision, etc.]
Looking at this chart the 830M has 5.0 compute capability, so in theory you'll be able to run TensorFlow (which requires 3.5). In practice you'll often hit problems with low memory on laptop GPUs, so you'll likely want to graduate to a desktop to do serious work, but it's a good way to get started.