I am trying to create preemptible VM on Google compute engine. For some reason I cannot select 1 GPU NVIDIA Tesla K80, it is simply grayed out. I can select 1 GPU NVIDIA Tesla P100.
I can select 2 GPU NVIDIA Tesla K80, but then I get error: "Quota 'PREEMPTIBLE_NVIDIA_K80_GPUS' exceeded. Limit: 1.0 in region us-central1."
I don't want to increase quota to 2 GPU, since I will have to deposit more money.
Previously, I was able to select 2 GPU NVIDIA Tesla K80 and launch instance successfully, but something changed in last 2 months or so and now it is not working
Related
This question already has answers here:
How does CUDA assign device IDs to GPUs?
(4 answers)
Closed 4 years ago.
I saw this solution, but it doesn't quite answer my question; it's also quite old so I'm not sure how relevant it is.
I keep getting conflicting outputs for the order of GPU units. There are two of them: Tesla K40 and NVS315 (legacy device that is never used). When I run deviceQuery, I get
Device 0: "Tesla K40m"
...
Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0
Device 1: "NVS 315"
...
Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0
On the other hand, nvidia-smi produces a different order:
0 NVS 315
1 Tesla K40m
Which I find very confusing. The solution I found for Tensorflow (and a similar one for Pytorch) is to use
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0"
PCI Bus ID is 4 for Tesla and 3 for NVS, so it should set it to 3 (NVS), is that right?
In pytorch I set
os.environ['CUDA_VISIBLE_DEVICES']='0'
...
device = torch.cuda.device(0)
print torch.cuda.get_device_name(0)
to get Tesla K40m
when I set instead
os.environ['CUDA_VISIBLE_DEVICES']='1'
device = torch.cuda.device(1)
print torch.cuda.get_device_name(0)
to get
UserWarning:
Found GPU0 NVS 315 which is of cuda capability 2.1.
PyTorch no longer supports this GPU because it is too old.
warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
NVS 315
So I'm quite confused: what's the true order of GPU devices that tf and pytorch use?
By default, CUDA orders the GPUs by computing power. GPU:0 will be the fastest GPU on your host, in your case the K40m.
If you set CUDA_DEVICE_ORDER='PCI_BUS_ID' then CUDA orders your GPU depending on how you set up your machine meaning that GPU:0 will be the GPU on your first PCI-E lane.
Both Tensorflow and PyTorch use the CUDA GPU order. That is consistent with what you showed:
os.environ['CUDA_VISIBLE_DEVICES']='0'
...
device = torch.cuda.device(0)
print torch.cuda.get_device_name(0)
Default order so GPU:0 is the K40m since it is the most powerful card on your host.
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ['CUDA_VISIBLE_DEVICES']='0'
...
device = torch.cuda.device(0)
print torch.cuda.get_device_name(0)
PCI-E lane order, so GPU:0 is the card with the lowest bus-id in your case the NVS.
I am going to buy a laptop to do some TF work. Is the GPU version of TF able to take advantage of Nvidia Quadro P1000 and P2000? Will it run faster on these two GPUs than on the mobile version of 1050Ti? Thanks
If I am correct, Tensorflow can run in all Nvidia devices that supports CUDA.
Check this website for their computational compabilities:
https://developer.nvidia.com/cuda-gpus
There you can see the computational power of Nvidia GPU cards.
For your questions about those three cards (P1000, P2000, GeForce 1050Ti), they all have the same computational capabilities: 6.1, which means they won't differ too much in GPU computation.
But from their datasheet (P2000, P1000, 1050ti):
---------------------------------------------------------
| | Memory | Memory Interface | Memory Bandwidth|
---------------------------------------------------------
|P1000 |4G GDRR5| 128 bit | 82Gb/s |
|P2000 |5G GDDR5| 160 bit | 140Gb/s |
|1050Ti |4G GDDR5| 128 bit | 112Gb/s |
---------------------------------------------------------
I would say, P2000 > 1050Ti > P1000
BTW, what does that 6.1 number mean? Basically, it means how much operations and functions they can support. You can find the details in the figure below and this link, and similar discussion here
Here is some of my console output. I am unsure what is the actually problem. When this is displayed I get a windows prompt stating Python.exe has stop working with the cause being ucrtbase.dll, but I've tried updating that and it still happens so I think that is the result of the real problem. Also I am notified by a taskbar message that my Nvidia Kernal Driver crashed, but recovered.
2017-11-04 17:48:17.363024: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-04 17:48:17.375024: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-04 17:48:19.995174: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:955
Found device 0 with properties:
name: Quadro K1100M
major: 3 minor: 0 memoryClockRate (GHz) 0.7055
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 1.93GiB
2017-11-04 17:48:19.995174: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:976] DMA: 0
2017-11-04 17:48:19.995174: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:986] 0: Y
2017-11-04 17:48:20.018175: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1045]
Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K1100M, pci bus id: 0000:01:00.0)
2017-11-04 17:49:35.796510: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.93GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-11-04 17:49:41.811854: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_driver.cc:1068] failed to synchronize the stop event: CUDA_ERROR_UNKNOWN
2017-11-04 17:49:41.811854: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_timer.cc:54] Internal: error destroying CUDA event in context 0000000026CFBE70: CUDA_ERROR_UNKNOWN
2017-11-04 17:49:41.811854: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_timer.cc:59] Internal: error destroying CUDA event in context 0000000026CFBE70: CUDA_ERROR_UNKNOWN
2017-11-04 17:49:41.811854: F C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_dnn.cc:2045] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED
If you're still looking for the answer, try reducing the batch size. I'm not entirely sure what is happening with the error (there's no explanation on github either), but reducing the batch size helped me
I am running following code on Google Cloud ML using BASIC GPU (Tesla K80)
https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10
LRN is taking the most amount of time and its running on CPU. I am wondering if following stats quoted in https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_train.py were obtained by running on CPU because I don't see thats the case.
System | Step Time (sec/batch) | Accuracy
1 Tesla K20m | 0.35-0.60 | ~86% at 60K steps (5 hours)
If I force it to run it with GPU it throws following error:
Cannot assign a device to node 'norm1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available. [[Node: norm1 = LRNT=DT_HALF, alpha=0.00011111111, beta=0.75, bias=1, depth_radius=4, _device="/device:GPU:0"]
I recently add a 8gb ram to my computer to facilitate computing, however the gpu tensorflow doesn't seem to recognize it even though my ubuntu recognize it. Here's my result after running sudo lshw -class memory
and the result is
*-memory
description: System Memory
physical id: 2c
slot: System board or motherboard
size: 16GiB
*-bank:0
description: SODIMM Synchronous 2400 MHz (0.4 ns)
product: HMA81GS6AFR8N-UH
vendor: 009C35230000
physical id: 0
serial: 31D92036
slot: ChannelA-DIMM0
size: 8GiB
width: 64 bits
clock: 2400MHz (0.4ns)
*-bank:1
description: SODIMM Synchronous 2400 MHz (0.4 ns)
product: CT8G4SFS824A.C8FAD1
vendor: 009D36160000
physical id: 1
serial: 156C0B48
slot: ChannelB-DIMM0
size: 8GiB
width: 64 bits
clock: 2400MHz (0.4ns)
However, the gpu tensorflow doesn't recognize it and still output as following
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.645
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.66GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
Is there any extra steps I need to do to get my full ram recognized?
GPU RAM and system RAM are two different and separate things