ram not detected by tensorflow gpu - tensorflow

I recently add a 8gb ram to my computer to facilitate computing, however the gpu tensorflow doesn't seem to recognize it even though my ubuntu recognize it. Here's my result after running sudo lshw -class memory
and the result is
*-memory
description: System Memory
physical id: 2c
slot: System board or motherboard
size: 16GiB
*-bank:0
description: SODIMM Synchronous 2400 MHz (0.4 ns)
product: HMA81GS6AFR8N-UH
vendor: 009C35230000
physical id: 0
serial: 31D92036
slot: ChannelA-DIMM0
size: 8GiB
width: 64 bits
clock: 2400MHz (0.4ns)
*-bank:1
description: SODIMM Synchronous 2400 MHz (0.4 ns)
product: CT8G4SFS824A.C8FAD1
vendor: 009D36160000
physical id: 1
serial: 156C0B48
slot: ChannelB-DIMM0
size: 8GiB
width: 64 bits
clock: 2400MHz (0.4ns)
However, the gpu tensorflow doesn't recognize it and still output as following
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.645
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.66GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
Is there any extra steps I need to do to get my full ram recognized?

GPU RAM and system RAM are two different and separate things

Related

Vega 56 Bandwidth Slow in (not only) TensowFlow - Why and how?

I just built a new AMD-based PC, with CPU - AMD Ryzen 7 3700X, GPU - AMD Radeon RX Vega 56, OS - Ubuntu 18.04. In order to use AMD GPU for Tensorflow, I follow these two to install ROCm. Everything seems fine and no problems in installation. I think I install ROCm 3. I do exact as the posts.
https://towardsdatascience.com/train-neural-networks-using-amd-gpus-and-keras-37189c453878
https://www.videogames.ai/Install-ROCM-Machine-Learning-AMD-GPU
video link: https://www.youtube.com/watch?v=fkSRkAoMS4g
But when I ran rocm-bandwidth-test in the terminal, as the video, I had result as below.
(base) nick#nick-nbpc:~$ rocm-bandwidth-test
........
RocmBandwidthTest Version: 2.3.11
Launch Command is: rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)
Device: 0, AMD Ryzen 7 3700X 8-Core Processor
Device: 1, Vega 10 XT [Radeon RX Vega 64], 2f:0.0
Inter-Device Access
D/D 0 1
0 1 0
1 1 1
Inter-Device Numa Distance
D/D 0 1
0 0 N/A
1 20 0
Unidirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 9.295924
1 8.892247 72.654038
Bdirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 17.103560
1 17.103560 N/A
(base) nick#nick-nbpc:~$
The video is using AMD RX 580 GPU, and I compare the technical specs from the link below.
https://www.youtube.com/watch?v=shstdFZJJ_o
which is showing that RX580 has memory bandwidth 256 Gb/s and Vega 56 has 409.6 Gb/s. In the other video, the uploader has a bandwidth 195 Gb/s at time 11:09 of the video. But my Vega 56 only has 72.5 Gb/s! This is a huge difference. I don't know what is wrong.
Then I install python 3.6 and TensorFlow-ROCm. And I git clone https://github.com/tensorflow/benchmarks.git, just as the video, to do the benchmark test in tensorflow.
Execute the code:
python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
Gives this result:
Done warm up
Step Img/sec total_loss
1 images/sec: 81.0 +/- 0.0 (jitter = 0.0) 7.765
10 images/sec: 80.7 +/- 0.1 (jitter = 0.2) 8.049
20 images/sec: 80.7 +/- 0.0 (jitter = 0.1) 7.808
30 images/sec: 80.7 +/- 0.0 (jitter = 0.1) 7.976
40 images/sec: 80.9 +/- 0.1 (jitter = 0.2) 7.591
50 images/sec: 81.2 +/- 0.1 (jitter = 0.3) 7.549
60 images/sec: 81.5 +/- 0.1 (jitter = 0.6) 7.819
70 images/sec: 81.7 +/- 0.1 (jitter = 1.1) 7.820
80 images/sec: 81.8 +/- 0.1 (jitter = 1.5) 7.847
90 images/sec: 82.0 +/- 0.1 (jitter = 0.8) 8.025
100 images/sec: 82.1 +/- 0.1 (jitter = 0.6) 8.029
----------------------------------------------------------------
total images/sec: 82.07
----------------------------------------------------------------
The result is not as good as I expected. I was expecting some number 100+. But due to my limited knowledge on Ubuntu/AMD/TensorFlow, I might be very likely wrong. If not, can someone tell me why my bandwidth is not as fast as 400 Gb/s?
========================================
clinfo
(base) nick#nick-nbpc:~$ clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (3137.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Vega 10 XT [Radeon RX Vega 64]
Device Topology: PCI[ B#47, D#0, F#0 ]
Max compute units: 56
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1590Mhz
Address bits: 64
Max memory allocation: 7287183769
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 26751
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 8573157376
Constant buffer size: 7287183769
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 2992216473
Max global variable size: 7287183769
Max global variable preferred total size: 8573157376
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x7fe56aa5fcf0
Name: gfx900
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3137.0 (HSA1.1,LC)
Profile: FULL_PROFILE
Version: OpenCL 2.0
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
(base) nick#nick-nbpc:~$
rocminfo
(base) nick#nick-nbpc:~$ rocminfo
ROCk module is loaded
Able to open /dev/kfd read-write
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 7 3700X 8-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 3700X 8-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 0
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 16436616(0xfacd88) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 16436616(0xfacd88) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
N/A
*******
Agent 2
*******
Name: gfx900
Uuid: GPU-02151e1bb9ee2144
Marketing Name: Vega 10 XT [Radeon RX Vega 64]
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26751(0x687f)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1590
BDFID: 12032
Internal Node ID: 1
Compute Unit: 56
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx900
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
(base) nick#nick-nbpc:~$
i cant answer the bandwidth question but i have just tried out the same benchmarks (according to the youtube video)
i get:
(vrocm1) user1#t1000test:~$ rocm-bandwidth-test
........
RocmBandwidthTest Version: 2.3.11
Launch Command is: rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)
Device: 0, AMD Ryzen 7 2700X Eight-Core Processor
Device: 1, Vega 10 XL/XT [Radeon RX Vega 56/64], 28:0.0
Inter-Device Access
D/D 0 1
0 1 0
1 1 1
Inter-Device Numa Distance
D/D 0 1
0 0 N/A
1 20 0
Unidirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 9.542044
1 9.028717 72.202459
Bdirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 17.144430
1 17.144430 N/A
which is the same as you got. but:
python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50
gives me:
Done warm up
Step Img/sec total_loss
1 images/sec: 172.0 +/- 0.0 (jitter = 0.0) 7.765
10 images/sec: 172.5 +/- 0.1 (jitter = 0.6) 8.049
20 images/sec: 172.6 +/- 0.1 (jitter = 0.4) 7.808
30 images/sec: 172.5 +/- 0.1 (jitter = 0.6) 7.976
40 images/sec: 172.6 +/- 0.1 (jitter = 0.5) 7.591
50 images/sec: 172.5 +/- 0.1 (jitter = 0.6) 7.549
60 images/sec: 172.6 +/- 0.1 (jitter = 0.5) 7.819
70 images/sec: 172.6 +/- 0.1 (jitter = 0.5) 7.819
80 images/sec: 172.6 +/- 0.1 (jitter = 0.5) 7.848
90 images/sec: 172.6 +/- 0.0 (jitter = 0.5) 8.025
100 images/sec: 172.5 +/- 0.0 (jitter = 0.5) 8.029
----------------------------------------------------------------
total images/sec: 172.39
----------------------------------------------------------------
cliinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (3182.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Vega 10 XL/XT [Radeon RX Vega 56/64]
Device Topology: PCI[ B#40, D#0, F#0 ]
Max compute units: 64
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 1630Mhz
Address bits: 64
Max memory allocation: 7287183769
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 26751
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 8573157376
Constant buffer size: 7287183769
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 2992216473
Max global variable size: 7287183769
Max global variable preferred total size: 8573157376
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x7efe04b66cd0
Name: gfx900
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3182.0 (HSA1.1,LC)
Profile: FULL_PROFILE
Version: OpenCL 2.0
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program
rocminfo
ROCk module is loaded
Able to open /dev/kfd read-write
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 7 2700X Eight-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 7 2700X Eight-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3700
BDFID: 0
Internal Node ID: 0
Compute Unit: 16
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32898020(0x1f5fbe4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32898020(0x1f5fbe4) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
N/A
*******
Agent 2
*******
Name: gfx900
Uuid: GPU-021508a5025618e4
Marketing Name: Vega 10 XL/XT [Radeon RX Vega 56/64]
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 26751(0x687f)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1630
BDFID: 10240
Internal Node ID: 1
Compute Unit: 64
SIMDs per CU: 4
Shader Engines: 4
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8372224(0x7fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx900
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
the only thing that you seem to have done differently is :
Execute the code: python tf_cnn_benchmarks.py --num_gpus=1
--batch_size=32 --model=resnet50
which executes the test in python2 ... (but maybe its just a typo)
greetings gspeet

Tensorflow not detecting GPU - Adding visible gpu devices: 0

I have a system with an NVIDIA GeForce GTX 980 Ti. I installed tensorflow, and look for the gpu device with tf.test.gpu_device_name(). It looks like it finds the gpu, but then says "Adding visible gpu devices: 0"
>>> import tensorflow as tf
>>> tf.test.gpu_device_name()
2019-01-08 10:01:12.589000: I tensorflow/core/platform/cpu_feature_guard.cc:141]
Your CPU supports instructions that this TensorFlow binary was not compiled to
use: AVX2
2019-01-08 10:01:12.855000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
432] Found device 0 with properties:
name: GeForce GTX 980 Ti major: 5 minor: 2 memoryClockRate(GHz): 1.228
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 5.67GiB
2019-01-08 10:01:12.862000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1
511] Adding visible gpu devices: 0
Interestingly, the 0 you are concerned about is not the 0 you would use for counting. Precisely, its not "detected 0 devices" but " device 0 detected".
"Adding visible device 0", 0 here is an identity for you GPU. Or you can say, the way of tensorflow to differentiate between multiple GPUs in the system.
Here is the output of my system, and I'm pretty sure, I m using up my gpu for computation.
So don't worry. You are good to go! 😉
Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 27 2018, 03:37:03) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> tf.test.gpu_device_name()
2019-01-08 20:51:02.212125: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-01-08 20:51:03.199893: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties:
name: GeForce GTX 1060 with Max-Q Design major: 6 minor: 1 memoryClockRate(GHz): 1.3415
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.97GiB
2019-01-08 20:51:03.207308: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2019-01-08 20:51:04.857881: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-08 20:51:04.861791: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0
2019-01-08 20:51:04.863796: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N
2019-01-08 20:51:04.867507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/device:GPU:0 with 4722 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
'/device:GPU:0'
Running the prompt as administrator solved the issue in my case
You can try one of the following commands:
device_lib.list_local_devices()
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
This will show you the gpu devices and their number.
My setup is as following to overcome the issue:
tensorflow 2.4.1
cuda 11.0.2
cudNN 8.1.0
So first you install tensorflow. Then you proceed with cuda (https://developer.nvidia.com/cuda-11.0-download-archive) and after you download the cudNN zip file from here -> https://developer.nvidia.com/rdp/cudnn-download, unzip and paste the cudnn64_8.dll file into C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin.
Then everything works like a charm.
I was also facing the same problem and creating a conda environment with an environment.yml file solved the issue for me. The content of the .yml file are as follows: Please make sure to provide your system path in the last line of the code. eg. "/home/nikhilanand_1921cs24" should be replaced with your system path.
name: keras-gpu
channels:
- conda-forge
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- _openmp_mutex=4.5=1_gnu
- _tflow_select=2.1.0=gpu
- absl-py=0.13.0=py39h06a4308_0
- aiohttp=3.8.1=py39h7f8727e_0
- aiosignal=1.2.0=pyhd3eb1b0_0
- astor=0.8.1=py39h06a4308_0
- astunparse=1.6.3=py_0
- async-timeout=4.0.1=pyhd3eb1b0_0
- attrs=21.2.0=pyhd3eb1b0_0
- blas=1.0=mkl
- blinker=1.4=py39h06a4308_0
- brotli=1.0.9=h7f98852_5
- brotli-bin=1.0.9=h7f98852_5
- brotlipy=0.7.0=py39h27cfd23_1003
- c-ares=1.17.1=h27cfd23_0
- ca-certificates=2021.10.8=ha878542_0
- cachetools=4.2.2=pyhd3eb1b0_0
- certifi=2021.10.8=py39hf3d152e_1
- cffi=1.14.6=py39h400218f_0
- charset-normalizer=2.0.4=pyhd3eb1b0_0
- click=8.0.3=pyhd3eb1b0_0
- cryptography=3.4.8=py39hd23ed53_0
- cudatoolkit=10.1.243=h6bb024c_0
- cudnn=7.6.5=cuda10.1_0
- cupti=10.1.168=0
- cycler=0.11.0=pyhd8ed1ab_0
- dataclasses=0.8=pyh6d0b6a4_7
- dbus=1.13.6=he372182_0
- expat=2.2.10=h9c3ff4c_0
- fontconfig=2.13.1=h6c09931_0
- fonttools=4.25.0=pyhd3eb1b0_0
- freetype=2.10.4=h0708190_1
- frozenlist=1.2.0=py39h7f8727e_0
- gast=0.4.0=pyhd3eb1b0_0
- glib=2.69.1=h4ff587b_1
- google-auth=1.33.0=pyhd3eb1b0_0
- google-auth-oauthlib=0.4.4=pyhd3eb1b0_0
- google-pasta=0.2.0=pyhd3eb1b0_0
- grpcio=1.42.0=py39hce63b2e_0
- gst-plugins-base=1.14.0=hbbd80ab_1
- gstreamer=1.14.0=h28cd5cc_2
- h5py=2.10.0=py39hec9cf62_0
- hdf5=1.10.6=hb1b8bf9_0
- icu=58.2=hf484d3e_1000
- idna=3.3=pyhd3eb1b0_0
- importlib-metadata=4.8.2=py39h06a4308_0
- intel-openmp=2021.4.0=h06a4308_3561
- jpeg=9d=h7f8727e_0
- keras-preprocessing=1.1.2=pyhd3eb1b0_0
- kiwisolver=1.3.1=py39h2531618_0
- lcms2=2.12=hddcbb42_0
- ld_impl_linux-64=2.35.1=h7274673_9
- libbrotlicommon=1.0.9=h7f98852_5
- libbrotlidec=1.0.9=h7f98852_5
- libbrotlienc=1.0.9=h7f98852_5
- libffi=3.3=he6710b0_2
- libgcc-ng=9.3.0=h5101ec6_17
- libgfortran-ng=7.5.0=ha8ba4b0_17
- libgfortran4=7.5.0=ha8ba4b0_17
- libgomp=9.3.0=h5101ec6_17
- libpng=1.6.37=h21135ba_2
- libprotobuf=3.17.2=h4ff587b_1
- libstdcxx-ng=9.3.0=hd4cf53a_17
- libtiff=4.2.0=h85742a9_0
- libuuid=1.0.3=h7f8727e_2
- libwebp-base=1.2.0=h27cfd23_0
- libxcb=1.13=h7f98852_1003
- libxml2=2.9.12=h03d6c58_0
- lz4-c=1.9.3=h9c3ff4c_1
- markdown=3.3.4=py39h06a4308_0
- matplotlib=3.4.3=py39hf3d152e_2
- matplotlib-base=3.4.3=py39hbbc1b5f_0
- mkl=2021.4.0=h06a4308_640
- mkl-service=2.4.0=py39h7f8727e_0
- mkl_fft=1.3.1=py39hd3c417c_0
- mkl_random=1.2.2=py39h51133e4_0
- multidict=5.1.0=py39h27cfd23_2
- munkres=1.1.4=pyh9f0ad1d_0
- ncurses=6.3=h7f8727e_2
- numpy=1.21.2=py39h20f2e39_0
- numpy-base=1.21.2=py39h79a1101_0
- oauthlib=3.1.1=pyhd3eb1b0_0
- olefile=0.46=pyh9f0ad1d_1
- openssl=1.1.1m=h7f8727e_0
- opt_einsum=3.3.0=pyhd3eb1b0_1
- pcre=8.45=h9c3ff4c_0
- pip=21.2.4=py39h06a4308_0
- protobuf=3.17.2=py39h295c915_0
- pthread-stubs=0.4=h36c2ea0_1001
- pyasn1=0.4.8=pyhd3eb1b0_0
- pyasn1-modules=0.2.8=py_0
- pycparser=2.21=pyhd3eb1b0_0
- pyjwt=2.1.0=py39h06a4308_0
- pyopenssl=21.0.0=pyhd3eb1b0_1
- pyparsing=3.0.7=pyhd8ed1ab_0
- pyqt=5.9.2=py39h2531618_6
- pysocks=1.7.1=py39h06a4308_0
- python=3.9.7=h12debd9_1
- python-dateutil=2.8.2=pyhd8ed1ab_0
- python-flatbuffers=2.0=pyhd3eb1b0_0
- python_abi=3.9=2_cp39
- qt=5.9.7=h5867ecd_1
- readline=8.1=h27cfd23_0
- requests=2.26.0=pyhd3eb1b0_0
- requests-oauthlib=1.3.0=py_0
- rsa=4.7.2=pyhd3eb1b0_1
- scipy=1.7.1=py39h292c36d_2
- setuptools=58.0.4=py39h06a4308_0
- sip=4.19.13=py39h295c915_0
- six=1.16.0=pyhd3eb1b0_0
- sqlite=3.36.0=hc218d9a_0
- tensorboard-plugin-wit=1.6.0=py_0
- tensorflow-estimator=2.6.0=pyh7b7c402_0
- termcolor=1.1.0=py39h06a4308_1
- tk=8.6.11=h1ccaba5_0
- tornado=6.1=py39h3811e60_1
- typing-extensions=3.10.0.2=hd3eb1b0_0
- typing_extensions=3.10.0.2=pyh06a4308_0
- tzdata=2021e=hda174b7_0
- urllib3=1.26.7=pyhd3eb1b0_0
- werkzeug=2.0.2=pyhd3eb1b0_0
- wheel=0.37.0=pyhd3eb1b0_1
- wrapt=1.13.3=py39h7f8727e_2
- xorg-libxau=1.0.9=h7f98852_0
- xorg-libxdmcp=1.1.3=h7f98852_0
- xz=5.2.5=h7b6447c_0
- yarl=1.6.3=py39h27cfd23_0
- zipp=3.6.0=pyhd3eb1b0_0
- zlib=1.2.11=h7f8727e_4
- zstd=1.4.9=ha95c52a_0
- pip:
- joblib==1.1.0
- keras==2.8.0
- keras-applications==1.0.8
- libclang==13.0.0
- opencv-python==4.5.5.62
- pandas==1.4.0
- pillow==9.0.1
- pytz==2021.3
- pyyaml==6.0
- scikit-learn==1.0.2
- tensorboard==2.8.0
- tensorboard-data-server==0.6.1
- tensorflow==2.8.0
- tensorflow-gpu==2.8.0
- tensorflow-io-gcs-filesystem==0.23.1
- tf-estimator-nightly==2.8.0.dev2021122109
- threadpoolctl==3.0.0
prefix: /home/nikhilanand_1921cs24/anaconda3/envs/keras-gpu
Create the environment by runningconda env create -f environment.yml

Google compute engine cannot select 1 NVIDIA Tesla K80

I am trying to create preemptible VM on Google compute engine. For some reason I cannot select 1 GPU NVIDIA Tesla K80, it is simply grayed out. I can select 1 GPU NVIDIA Tesla P100.
I can select 2 GPU NVIDIA Tesla K80, but then I get error: "Quota 'PREEMPTIBLE_NVIDIA_K80_GPUS' exceeded. Limit: 1.0 in region us-central1."
I don't want to increase quota to 2 GPU, since I will have to deposit more money.
Previously, I was able to select 2 GPU NVIDIA Tesla K80 and launch instance successfully, but something changed in last 2 months or so and now it is not working

Tensorflow CUDA fails with error "failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED"

Here is some of my console output. I am unsure what is the actually problem. When this is displayed I get a windows prompt stating Python.exe has stop working with the cause being ucrtbase.dll, but I've tried updating that and it still happens so I think that is the result of the real problem. Also I am notified by a taskbar message that my Nvidia Kernal Driver crashed, but recovered.
2017-11-04 17:48:17.363024: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-11-04 17:48:17.375024: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-11-04 17:48:19.995174: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:955
Found device 0 with properties:
name: Quadro K1100M
major: 3 minor: 0 memoryClockRate (GHz) 0.7055
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 1.93GiB
2017-11-04 17:48:19.995174: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:976] DMA: 0
2017-11-04 17:48:19.995174: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:986] 0: Y
2017-11-04 17:48:20.018175: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1045]
Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro K1100M, pci bus id: 0000:01:00.0)
2017-11-04 17:49:35.796510: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.93GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-11-04 17:49:41.811854: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_driver.cc:1068] failed to synchronize the stop event: CUDA_ERROR_UNKNOWN
2017-11-04 17:49:41.811854: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_timer.cc:54] Internal: error destroying CUDA event in context 0000000026CFBE70: CUDA_ERROR_UNKNOWN
2017-11-04 17:49:41.811854: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_timer.cc:59] Internal: error destroying CUDA event in context 0000000026CFBE70: CUDA_ERROR_UNKNOWN
2017-11-04 17:49:41.811854: F C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_dnn.cc:2045] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED
If you're still looking for the answer, try reducing the batch size. I'm not entirely sure what is happening with the error (there's no explanation on github either), but reducing the batch size helped me

Tensorflow not using GPU for one dataset, where it does for a very similar dataset

I'm using TensorFlow to train a model using data originating from two sources. For both sources the training and validation data shape are almost identical and the dtypes throughout are np.float32.
The strange thing is, when I use the first data set the GPU on my machine is used, but when using the second data set the GPU is not used.
Does anyone have some suggestions on how to investigate?
print(s1_train_data.shape)
print(s1_train_data.values)
(1165032, 941)
[[ 0.45031181 -0.99680316 0.63686389 ..., 0.22323072 -0.37929842 0. ]
[-0.40660214 0.34022757 -0.00710014 ..., -1.43051076 -0.14785887 1. ]
[ 0.03955967 -0.91227823 0.37887612 ..., 0.16451506 -1.02560401 0. ]
...,
[ 0.11746094 -0.18229018 0.43319091 ..., 0.36532226 -0.48208624 0. ]
[ 0.110379 -1.07364404 0.42837444 ..., 0.74732345 0.92880726 0. ]
[-0.81027234 -1.04290771 -0.56407243 ..., 0.25084609 -0.1797282 1. ]]
print(s2_train_data.shape)
print(s2_train_data.values)
(559873, 941)
[[ 0. 0. 0. ..., -1.02008295 0.27371082 0. ]
[ 0. 0. 0. ..., -0.74775815 0.18743835 0. ]
[ 0. 0. 0. ..., 0.6469788 0.67864949 1. ]
...,
[ 0. 0. 0. ..., -0.88198501 -0.02421325 1. ]
[ 0. 0. 0. ..., 0.28361112 -1.08478808 1. ]
[ 0. 0. 0. ..., 0.22360609 0.50698668 0. ]]
Edit. Here's a snip of the log with log_device_placement=True.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:03.0
Total memory: 4.00GiB
Free memory: 3.95GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x7578380
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:04.0
Total memory: 4.00GiB
Free memory: 3.95GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x7c54b10
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 2 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:05.0
Total memory: 4.00GiB
Free memory: 3.95GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context when one is currently active; existing: 0x65bb1d0
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 3 with properties:
name: GRID K520
major: 3 minor: 0 memoryClockRate (GHz) 0.797
pciBusID 0000:00:06.0
Total memory: 4.00GiB
Free memory: 3.95GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 0 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 1 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 2 and 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 1
I tensorflow/core/common_runtime/gpu/gpu_device.cc:777] Peer access not supported between device ordinals 3 and 2
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y N N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: N Y N N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2: N N Y N
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3: N N N Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GRID K520, pci bus id: 0000:00:04.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GRID K520, pci bus id: 0000:00:05.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:3) -> (device: 3, name: GRID K520, pci bus id: 0000:00:06.0)
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GRID K520, pci bus id: 0000:00:03.0
/job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: GRID K520, pci bus id: 0000:00:04.0
/job:localhost/replica:0/task:0/gpu:2 -> device: 2, name: GRID K520, pci bus id: 0000:00:05.0
/job:localhost/replica:0/task:0/gpu:3 -> device: 3, name: GRID K520, pci bus id: 0000:00:06.0
I tensorflow/core/common_runtime/direct_session.cc:255] Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -> device: 0, name: GRID K520, pci bus id: 0000:00:03.0
/job:localhost/replica:0/task:0/gpu:1 -> device: 1, name: GRID K520, pci bus id: 0000:00:04.0
/job:localhost/replica:0/task:0/gpu:2 -> device: 2, name: GRID K520, pci bus id: 0000:00:05.0
/job:localhost/replica:0/task:0/gpu:3 -> device: 3, name: GRID K520, pci bus id: 0000:00:06.0
WARNING:tensorflow:From tf.py:183 in get_session.: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Use `tf.global_variables_initializer` instead.
gradients_3/add_grad/Shape_1: (Const): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:821] gradients_3/add_grad/Shape_1: (Const)/job:localhost/replica:0/task:0/gpu:0
gradients_3/add_2_grad/Shape_1: (Const): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:821] gradients_3/add_2_grad/Shape_1: (Const)/job:localhost/replica:0/task:0/gpu:0
gradients_3/gradients_2/Mean_1_grad/Tile_grad/range: (Range): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:821] gradients_3/gradients_2/Mean_1_grad/Tile_grad/range: (Range)/job:localhost/replica:0/task:0/gpu:0
gradients_3/gradients_2/Mean_1_grad/truediv_grad/Shape_1: (Const): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:821] gradients_3/gradients_2/Mean_1_grad/truediv_grad/Shape_1: (Const)/job:localhost/replica:0/task:0/gpu:0
gradients_3/gradients_2/logistic_loss_1_grad/Sum_grad/Size: (Const): /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:821] gradients_3/gradients_2/logistic_loss_1_grad/Sum_grad/Size: (Const)/job:localhost/replica:0/task:0/gpu:0
gradients_3/gradients_2/logistic_loss_1_grad/Sum_grad/range: (Range): /job:localhost/replica:0/task:0/gpu:0
It does seem to be placing the tasks on the GPU, however I still see almost entirely 0% GPU-Util in the nvidia-smi monitor.
The pandas dataframe is of course in memory. Is there any other IO that could be impacting this process?
Edit 2: I captured the log_device_placement logs for both the fast and slow data sets. They are identical, even though in one case the GPU usage is 25%, and the other 0%. Really scratching my head now....
The cause of the slowness was the memory layout of the ndarray backing the DataFrame. The s2 data was column-major meaning that each row of features and target was not contiguous.
This operation changes the memory layout:
s2_train_data = s2_train_data.values.copy(order='C')
and now the GPU is running at 26% utilisation. Happy days :)