Tensorflow not detecting GPU - Adding visible gpu devices: 0 - tensorflow
I have a system with an NVIDIA GeForce GTX 980 Ti. I installed tensorflow, and look for the gpu device with tf.test.gpu_device_name(). It looks like it finds the gpu, but then says "Adding visible gpu devices: 0"
>>> import tensorflow as tf
>>> tf.test.gpu_device_name()
2019-01-08 10:01:12.589000: I tensorflow/core/platform/cpu_feature_guard.cc:141]
Your CPU supports instructions that this TensorFlow binary was not compiled to
use: AVX2
2019-01-08 10:01:12.855000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
432] Found device 0 with properties:
name: GeForce GTX 980 Ti major: 5 minor: 2 memoryClockRate(GHz): 1.228
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 5.67GiB
2019-01-08 10:01:12.862000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
511] Adding visible gpu devices: 0
Interestingly, the 0 you are concerned about is not the 0 you would use for counting. Precisely, its not "detected 0 devices" but " device 0 detected".
"Adding visible device 0", 0 here is an identity for you GPU. Or you can say, the way of tensorflow to differentiate between multiple GPUs in the system.
Here is the output of my system, and I'm pretty sure, I m using up my gpu for computation.
So don't worry. You are good to go! 😉
Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.test.gpu_device_name()
2019-01-08 20:51:02.212125: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-01-08 20:51:03.199893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: GeForce GTX 1060 with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.3415
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.97GiB
2019-01-08 20:51:03.207308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-01-08 20:51:04.857881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-08 20:51:04.861791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0
2019-01-08 20:51:04.863796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N
2019-01-08 20:51:04.867507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/device:GPU:0 with 4722 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
'/device:GPU:0'
Running the prompt as administrator solved the issue in my case
You can try one of the following commands:
device_lib.list_local_devices()
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
This will show you the gpu devices and their number.
My setup is as following to overcome the issue:
tensorflow 2.4.1
cuda 11.0.2
cudNN 8.1.0
So first you install tensorflow. Then you proceed with cuda (https://developer.nvidia.com/cuda-11.0-download-archive) and after you download the cudNN zip file from here -> https://developer.nvidia.com/rdp/cudnn-download, unzip and paste the cudnn64_8.dll file into C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin.
Then everything works like a charm.
I was also facing the same problem and creating a conda environment with an environment.yml file solved the issue for me. The content of the .yml file are as follows: Please make sure to provide your system path in the last line of the code. eg. "/home/nikhilanand_1921cs24" should be replaced with your system path.
name: keras-gpu
channels:
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- _openmp_mutex=4.5=1_gnu
- _tflow_select=2.1.0=gpu
- absl-py=0.13.0=py39h06a4308_0
- aiohttp=3.8.1=py39h7f8727e_0
- aiosignal=1.2.0=pyhd3eb1b0_0
- astor=0.8.1=py39h06a4308_0
- astunparse=1.6.3=py_0
- async-timeout=4.0.1=pyhd3eb1b0_0
- attrs=21.2.0=pyhd3eb1b0_0
- blas=1.0=mkl
- blinker=1.4=py39h06a4308_0
- brotli=1.0.9=h7f98852_5
- brotli-bin=1.0.9=h7f98852_5
- brotlipy=0.7.0=py39h27cfd23_1003
- c-ares=1.17.1=h27cfd23_0
- ca-certificates=2021.10.8=ha878542_0
- cachetools=4.2.2=pyhd3eb1b0_0
- certifi=2021.10.8=py39hf3d152e_1
- cffi=1.14.6=py39h400218f_0
- charset-normalizer=2.0.4=pyhd3eb1b0_0
- click=8.0.3=pyhd3eb1b0_0
- cryptography=3.4.8=py39hd23ed53_0
- cudatoolkit=10.1.243=h6bb024c_0
- cudnn=7.6.5=cuda10.1_0
- cupti=10.1.168=0
- cycler=0.11.0=pyhd8ed1ab_0
- dataclasses=0.8=pyh6d0b6a4_7
- dbus=1.13.6=he372182_0
- expat=2.2.10=h9c3ff4c_0
- fontconfig=2.13.1=h6c09931_0
- fonttools=4.25.0=pyhd3eb1b0_0
- freetype=2.10.4=h0708190_1
- frozenlist=1.2.0=py39h7f8727e_0
- gast=0.4.0=pyhd3eb1b0_0
- glib=2.69.1=h4ff587b_1
- google-auth=1.33.0=pyhd3eb1b0_0
- google-auth-oauthlib=0.4.4=pyhd3eb1b0_0
- google-pasta=0.2.0=pyhd3eb1b0_0
- grpcio=1.42.0=py39hce63b2e_0
- gst-plugins-base=1.14.0=hbbd80ab_1
- gstreamer=1.14.0=h28cd5cc_2
- h5py=2.10.0=py39hec9cf62_0
- hdf5=1.10.6=hb1b8bf9_0
- icu=58.2=hf484d3e_1000
- idna=3.3=pyhd3eb1b0_0
- importlib-metadata=4.8.2=py39h06a4308_0
- intel-openmp=2021.4.0=h06a4308_3561
- jpeg=9d=h7f8727e_0
- keras-preprocessing=1.1.2=pyhd3eb1b0_0
- kiwisolver=1.3.1=py39h2531618_0
- lcms2=2.12=hddcbb42_0
- ld_impl_linux-64=2.35.1=h7274673_9
- libbrotlicommon=1.0.9=h7f98852_5
- libbrotlidec=1.0.9=h7f98852_5
- libbrotlienc=1.0.9=h7f98852_5
- libffi=3.3=he6710b0_2
- libgcc-ng=9.3.0=h5101ec6_17
- libgfortran-ng=7.5.0=ha8ba4b0_17
- libgfortran4=7.5.0=ha8ba4b0_17
- libgomp=9.3.0=h5101ec6_17
- libpng=1.6.37=h21135ba_2
- libprotobuf=3.17.2=h4ff587b_1
- libstdcxx-ng=9.3.0=hd4cf53a_17
- libtiff=4.2.0=h85742a9_0
- libuuid=1.0.3=h7f8727e_2
- libwebp-base=1.2.0=h27cfd23_0
- libxcb=1.13=h7f98852_1003
- libxml2=2.9.12=h03d6c58_0
- lz4-c=1.9.3=h9c3ff4c_1
- markdown=3.3.4=py39h06a4308_0
- matplotlib=3.4.3=py39hf3d152e_2
- matplotlib-base=3.4.3=py39hbbc1b5f_0
- mkl=2021.4.0=h06a4308_640
- mkl-service=2.4.0=py39h7f8727e_0
- mkl_fft=1.3.1=py39hd3c417c_0
- mkl_random=1.2.2=py39h51133e4_0
- multidict=5.1.0=py39h27cfd23_2
- munkres=1.1.4=pyh9f0ad1d_0
- ncurses=6.3=h7f8727e_2
- numpy=1.21.2=py39h20f2e39_0
- numpy-base=1.21.2=py39h79a1101_0
- oauthlib=3.1.1=pyhd3eb1b0_0
- olefile=0.46=pyh9f0ad1d_1
- openssl=1.1.1m=h7f8727e_0
- opt_einsum=3.3.0=pyhd3eb1b0_1
- pcre=8.45=h9c3ff4c_0
- pip=21.2.4=py39h06a4308_0
- protobuf=3.17.2=py39h295c915_0
- pthread-stubs=0.4=h36c2ea0_1001
- pyasn1=0.4.8=pyhd3eb1b0_0
- pyasn1-modules=0.2.8=py_0
- pycparser=2.21=pyhd3eb1b0_0
- pyjwt=2.1.0=py39h06a4308_0
- pyopenssl=21.0.0=pyhd3eb1b0_1
- pyparsing=3.0.7=pyhd8ed1ab_0
- pyqt=5.9.2=py39h2531618_6
- pysocks=1.7.1=py39h06a4308_0
- python=3.9.7=h12debd9_1
- python-dateutil=2.8.2=pyhd8ed1ab_0
- python-flatbuffers=2.0=pyhd3eb1b0_0
- python_abi=3.9=2_cp39
- qt=5.9.7=h5867ecd_1
- readline=8.1=h27cfd23_0
- requests=2.26.0=pyhd3eb1b0_0
- requests-oauthlib=1.3.0=py_0
- rsa=4.7.2=pyhd3eb1b0_1
- scipy=1.7.1=py39h292c36d_2
- setuptools=58.0.4=py39h06a4308_0
- sip=4.19.13=py39h295c915_0
- six=1.16.0=pyhd3eb1b0_0
- sqlite=3.36.0=hc218d9a_0
- tensorboard-plugin-wit=1.6.0=py_0
- tensorflow-estimator=2.6.0=pyh7b7c402_0
- termcolor=1.1.0=py39h06a4308_1
- tk=8.6.11=h1ccaba5_0
- tornado=6.1=py39h3811e60_1
- typing-extensions=3.10.0.2=hd3eb1b0_0
- typing_extensions=3.10.0.2=pyh06a4308_0
- tzdata=2021e=hda174b7_0
- urllib3=1.26.7=pyhd3eb1b0_0
- werkzeug=2.0.2=pyhd3eb1b0_0
- wheel=0.37.0=pyhd3eb1b0_1
- wrapt=1.13.3=py39h7f8727e_2
- xorg-libxau=1.0.9=h7f98852_0
- xorg-libxdmcp=1.1.3=h7f98852_0
- xz=5.2.5=h7b6447c_0
- yarl=1.6.3=py39h27cfd23_0
- zipp=3.6.0=pyhd3eb1b0_0
- zlib=1.2.11=h7f8727e_4
- zstd=1.4.9=ha95c52a_0
- pip:
- joblib==1.1.0
- keras==2.8.0
- keras-applications==1.0.8
- libclang==13.0.0
- opencv-python==4.5.5.62
- pandas==1.4.0
- pillow==9.0.1
- pytz==2021.3
- pyyaml==6.0
- scikit-learn==1.0.2
- tensorboard==2.8.0
- tensorboard-data-server==0.6.1
- tensorflow==2.8.0
- tensorflow-gpu==2.8.0
- tensorflow-io-gcs-filesystem==0.23.1
- tf-estimator-nightly==2.8.0.dev2021122109
- threadpoolctl==3.0.0
prefix: /home/nikhilanand_1921cs24/anaconda3/envs/keras-gpu
Create the environment by runningconda env create -f environment.yml
Related
Order of CUDA devices [duplicate]
This question already has answers here: How does CUDA assign device IDs to GPUs? (4 answers) Closed 4 years ago. I saw this solution, but it doesn't quite answer my question; it's also quite old so I'm not sure how relevant it is. I keep getting conflicting outputs for the order of GPU units. There are two of them: Tesla K40 and NVS315 (legacy device that is never used). When I run deviceQuery, I get Device 0: "Tesla K40m" ... Device PCI Domain ID / Bus ID / location ID: 0 / 4 / 0 Device 1: "NVS 315" ... Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0 On the other hand, nvidia-smi produces a different order: 0 NVS 315 1 Tesla K40m Which I find very confusing. The solution I found for Tensorflow (and a similar one for Pytorch) is to use import os os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" os.environ["CUDA_VISIBLE_DEVICES"]="0" PCI Bus ID is 4 for Tesla and 3 for NVS, so it should set it to 3 (NVS), is that right? In pytorch I set os.environ['CUDA_VISIBLE_DEVICES']='0' ... device = torch.cuda.device(0) print torch.cuda.get_device_name(0) to get Tesla K40m when I set instead os.environ['CUDA_VISIBLE_DEVICES']='1' device = torch.cuda.device(1) print torch.cuda.get_device_name(0) to get UserWarning: Found GPU0 NVS 315 which is of cuda capability 2.1. PyTorch no longer supports this GPU because it is too old. warnings.warn(old_gpu_warn % (d, name, major, capability[1])) NVS 315 So I'm quite confused: what's the true order of GPU devices that tf and pytorch use?
By default, CUDA orders the GPUs by computing power. GPU:0 will be the fastest GPU on your host, in your case the K40m. If you set CUDA_DEVICE_ORDER='PCI_BUS_ID' then CUDA orders your GPU depending on how you set up your machine meaning that GPU:0 will be the GPU on your first PCI-E lane. Both Tensorflow and PyTorch use the CUDA GPU order. That is consistent with what you showed: os.environ['CUDA_VISIBLE_DEVICES']='0' ... device = torch.cuda.device(0) print torch.cuda.get_device_name(0) Default order so GPU:0 is the K40m since it is the most powerful card on your host. os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" os.environ['CUDA_VISIBLE_DEVICES']='0' ... device = torch.cuda.device(0) print torch.cuda.get_device_name(0) PCI-E lane order, so GPU:0 is the card with the lowest bus-id in your case the NVS.
Google compute engine cannot select 1 NVIDIA Tesla K80
I am trying to create preemptible VM on Google compute engine. For some reason I cannot select 1 GPU NVIDIA Tesla K80, it is simply grayed out. I can select 1 GPU NVIDIA Tesla P100. I can select 2 GPU NVIDIA Tesla K80, but then I get error: "Quota 'PREEMPTIBLE_NVIDIA_K80_GPUS' exceeded. Limit: 1.0 in region us-central1." I don't want to increase quota to 2 GPU, since I will have to deposit more money. Previously, I was able to select 2 GPU NVIDIA Tesla K80 and launch instance successfully, but something changed in last 2 months or so and now it is not working
Tensorflow CUDA fails with error "failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED"
Here is some of my console output. I am unsure what is the actually problem. When this is displayed I get a windows prompt stating Python.exe has stop working with the cause being ucrtbase.dll, but I've tried updating that and it still happens so I think that is the result of the real problem. Also I am notified by a taskbar message that my Nvidia Kernal Driver crashed, but recovered. 2017-11-04 17:48:17.363024: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2017-11-04 17:48:17.375024: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2017-11-04 17:48:19.995174: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:955 Found device 0 with properties: name: Quadro K1100M major: 3 minor: 0 memoryClockRate (GHz) 0.7055 pciBusID 0000:01:00.0 Total memory: 2.00GiB Free memory: 1.93GiB 2017-11-04 17:48:19.995174: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:976] DMA: 0 2017-11-04 17:48:19.995174: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:986] 0: Y 2017-11-04 17:48:20.018175: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K1100M, pci bus id: 0000:01:00.0) 2017-11-04 17:49:35.796510: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.93GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available. 2017-11-04 17:49:41.811854: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_driver.cc:1068] failed to synchronize the stop event: CUDA_ERROR_UNKNOWN 2017-11-04 17:49:41.811854: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_timer.cc:54] Internal: error destroying CUDA event in context 0000000026CFBE70: CUDA_ERROR_UNKNOWN 2017-11-04 17:49:41.811854: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_timer.cc:59] Internal: error destroying CUDA event in context 0000000026CFBE70: CUDA_ERROR_UNKNOWN 2017-11-04 17:49:41.811854: F C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_dnn.cc:2045] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED
If you're still looking for the answer, try reducing the batch size. I'm not entirely sure what is happening with the error (there's no explanation on github either), but reducing the batch size helped me
ram not detected by tensorflow gpu
I recently add a 8gb ram to my computer to facilitate computing, however the gpu tensorflow doesn't seem to recognize it even though my ubuntu recognize it. Here's my result after running sudo lshw -class memory and the result is *-memory description: System Memory physical id: 2c slot: System board or motherboard size: 16GiB *-bank:0 description: SODIMM Synchronous 2400 MHz (0.4 ns) product: HMA81GS6AFR8N-UH vendor: 009C35230000 physical id: 0 serial: 31D92036 slot: ChannelA-DIMM0 size: 8GiB width: 64 bits clock: 2400MHz (0.4ns) *-bank:1 description: SODIMM Synchronous 2400 MHz (0.4 ns) product: CT8G4SFS824A.C8FAD1 vendor: 009D36160000 physical id: 1 serial: 156C0B48 slot: ChannelB-DIMM0 size: 8GiB width: 64 bits clock: 2400MHz (0.4ns) However, the gpu tensorflow doesn't recognize it and still output as following I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate (GHz) 1.645 pciBusID 0000:01:00.0 Total memory: 7.92GiB Free memory: 7.66GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0) Is there any extra steps I need to do to get my full ram recognized?
GPU RAM and system RAM are two different and separate things
Tensorflow not using GPU for one dataset, where it does for a very similar dataset
I'm using TensorFlow to train a model using data originating from two sources. For both sources the training and validation data shape are almost identical and the dtypes throughout are np.float32. The strange thing is, when I use the first data set the GPU on my machine is used, but when using the second data set the GPU is not used. Does anyone have some suggestions on how to investigate? print(s1_train_data.shape) print(s1_train_data.values) (1165032, 941) [[ 0.45031181 -0.99680316 0.63686389 ..., 0.22323072 -0.37929842 0. ] [-0.40660214 0.34022757 -0.00710014 ..., -1.43051076 -0.14785887 1. ] [ 0.03955967 -0.91227823 0.37887612 ..., 0.16451506 -1.02560401 0. ] ..., [ 0.11746094 -0.18229018 0.43319091 ..., 0.36532226 -0.48208624 0. ] [ 0.110379 -1.07364404 0.42837444 ..., 0.74732345 0.92880726 0. ] [-0.81027234 -1.04290771 -0.56407243 ..., 0.25084609 -0.1797282 1. ]] print(s2_train_data.shape) print(s2_train_data.values) (559873, 941) [[ 0. 0. 0. ..., -1.02008295 0.27371082 0. ] [ 0. 0. 0. ..., -0.74775815 0.18743835 0. ] [ 0. 0. 0. ..., 0.6469788 0.67864949 1. ] ..., [ 0. 0. 0. ..., -0.88198501 -0.02421325 1. ] [ 0. 0. 0. ..., 0.28361112 -1.08478808 1. ] [ 0. 0. 0. ..., 0.22360609 0.50698668 0. ]] Edit. Here's a snip of the log with log_device_placement=True. I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:03.0 Total memory: 4.00GiB Free memory: 3.95GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x7578380 I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:04.0 Total memory: 4.00GiB Free memory: 3.95GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x7c54b10 I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 2 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:05.0 Total memory: 4.00GiB Free memory: 3.95GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x65bb1d0 I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 3 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate (GHz) 0.797 pciBusID 0000:00:06.0 Total memory: 4.00GiB Free memory: 3.95GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 1 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 2 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 3 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 2 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 3 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 1 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 3 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 1 I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 2 I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y N N N I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: N Y N N I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2: N N Y N I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3: N N N Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0) I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GRID K520, pci bus id: 0000:00:04.0) I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GRID K520, pci bus id: 0000:00:05.0) I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GRID K520, pci bus id: 0000:00:06.0) Device mapping: /job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GRID K520, pci bus id: 0000:00:03.0 /job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: GRID K520, pci bus id: 0000:00:04.0 /job:localhost/replica:0/task:0/gpu:2 -> device: 2, name: GRID K520, pci bus id: 0000:00:05.0 /job:localhost/replica:0/task:0/gpu:3 -> device: 3, name: GRID K520, pci bus id: 0000:00:06.0 I tensorflow/core/common_runtime/direct_session.cc:255] Device mapping: /job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GRID K520, pci bus id: 0000:00:03.0 /job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: GRID K520, pci bus id: 0000:00:04.0 /job:localhost/replica:0/task:0/gpu:2 -> device: 2, name: GRID K520, pci bus id: 0000:00:05.0 /job:localhost/replica:0/task:0/gpu:3 -> device: 3, name: GRID K520, pci bus id: 0000:00:06.0 WARNING:tensorflow:From tf.py:183 in get_session.: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Use `tf.global_variables_initializer` instead. gradients_3/add_grad/Shape_1: (Const): /job:localhost/replica:0/task:0/gpu:0 I tensorflow/core/common_runtime/simple_placer.cc:821] gradients_3/add_grad/Shape_1: (Const)/job:localhost/replica:0/task:0/gpu:0 gradients_3/add_2_grad/Shape_1: (Const): /job:localhost/replica:0/task:0/gpu:0 I tensorflow/core/common_runtime/simple_placer.cc:821] gradients_3/add_2_grad/Shape_1: (Const)/job:localhost/replica:0/task:0/gpu:0 gradients_3/gradients_2/Mean_1_grad/Tile_grad/range: (Range): /job:localhost/replica:0/task:0/gpu:0 I tensorflow/core/common_runtime/simple_placer.cc:821] gradients_3/gradients_2/Mean_1_grad/Tile_grad/range: (Range)/job:localhost/replica:0/task:0/gpu:0 gradients_3/gradients_2/Mean_1_grad/truediv_grad/Shape_1: (Const): /job:localhost/replica:0/task:0/gpu:0 I tensorflow/core/common_runtime/simple_placer.cc:821] gradients_3/gradients_2/Mean_1_grad/truediv_grad/Shape_1: (Const)/job:localhost/replica:0/task:0/gpu:0 gradients_3/gradients_2/logistic_loss_1_grad/Sum_grad/Size: (Const): /job:localhost/replica:0/task:0/gpu:0 I tensorflow/core/common_runtime/simple_placer.cc:821] gradients_3/gradients_2/logistic_loss_1_grad/Sum_grad/Size: (Const)/job:localhost/replica:0/task:0/gpu:0 gradients_3/gradients_2/logistic_loss_1_grad/Sum_grad/range: (Range): /job:localhost/replica:0/task:0/gpu:0 It does seem to be placing the tasks on the GPU, however I still see almost entirely 0% GPU-Util in the nvidia-smi monitor. The pandas dataframe is of course in memory. Is there any other IO that could be impacting this process? Edit 2: I captured the log_device_placement logs for both the fast and slow data sets. They are identical, even though in one case the GPU usage is 25%, and the other 0%. Really scratching my head now....
The cause of the slowness was the memory layout of the ndarray backing the DataFrame. The s2 data was column-major meaning that each row of features and target was not contiguous. This operation changes the memory layout: s2_train_data = s2_train_data.values.copy(order='C') and now the GPU is running at 26% utilisation. Happy days :)