GPU ->CPU Memcpy failed in tensorflow word2vec in gpu occured - tensorflow

I I am studying word2vec of tensorflow.
We bought two 1080i for parallel processing of gpu.
Mounting was successful and p2p was successful.
However, I tried to assign it to gpu using the command with tf.device ('/ gpu: 0')
The following error occurs :
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1 with properties:
name: GeForce GTX 1080 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.645
pciBusID 0000:66:00.0
Total memory: 10.91GiB
Free memory: 10.21GiB
tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1
tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y
tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:66:00.0)
I word2vec_kernels.cc:246] Data file: data/spouse_freebase/input2.nt contains 34966827 bytes, 2620786 words, 11769 unique words, 11769 unique frequent words.
E tensorflow/stream_executor/cuda/cuda_driver.cc:1276] failed to enqueue async memcpy from device to host: CUDA_ERROR_INVALID_VALUE; host dst: 0x104d5000000; GPU src: 0x7f12c800cbc0; size: 8=0x8
I tensorflow/stream_executor/stream.cc:1338] stream 0x39c2160 did not wait for stream: 0x39bf9a0
I tensorflow/stream_executor/stream.cc:3775] stream 0x39c2160 did not memcpy device-to-host; source: 0x3bd0d00
F tensorflow/core/common_runtime/gpu/gpu_util.cc:296] GPU->CPU Memcpy failed
I think this error is the out of memory of the gpu.
I wait for your help.
thank you.

I had the same issue. I've just switched off G-SYNC support in Nvidia settings and it helped.

Related

(tensorflow) Am I using two gpus in parallel correctly?

(I'am sorry if this question is too novice but as I don't quite understand and want to double check whether I am using two gpus in parallel in a correct manner, I ask you the following question.)
Two gpus (with the same model) are installed in the pc I am using. In a pycharm project, I run a tensorflow code setting
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
, which then runs with a initial run log
Using TensorFlow backend.
2018-09-15 03:36:36.727152: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-09-15 03:36:37.080157: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:17:00.0
totalMemory: 11.00GiB freeMemory: 9.08GiB
2018-09-15 03:36:37.080671: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-15 03:36:37.796088: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-15 03:36:37.796320: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0
2018-09-15 03:36:37.796469: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0: N
2018-09-15 03:36:37.796723: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8783 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1)
Then in another pycharm project, I run a tensorflow code setting
os.environ["CUDA_VISIBLE_DEVICES"] = '1'
which then shows in run log
Using TensorFlow backend.
2018-09-15 03:37:00.119630: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-09-15 03:37:00.468546: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:65:00.0
totalMemory: 11.00GiB freeMemory: 9.08GiB
2018-09-15 03:37:00.468930: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-15 03:37:01.199726: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-15 03:37:01.199950: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0
2018-09-15 03:37:01.200096: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0: N
2018-09-15 03:37:01.200349: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8783 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
What worries me that they have both device 0. But their pciBusIDs are different.
So my simple question is am I using two gpus in parallel in a correct manner?
As I am using Windows 10, I monitored gpu usages with Device Manager, and it seems correct to me. But I just want to hear from experts.
And if it is okay for you to answer, what is pci bus ID, roughly? And why are both of them showing device 0?
No, you have to add
with tf.device("your_device_name")
Just follow this tutorial, section Using Multiple GPUs https://www.tensorflow.org/guide/using_gpu

Which GPU is used when 2 gpus are available but no specific selections on them

I have two gpus installed in my pc as they are to be used in parallel (without any SLI or likes). Suppose I run a simple code in tensorflow like linear regression as in this. Then which gpu is used? Are both of them used? Here is run log.
2018-09-15 02:55:36.314345: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-09-15 02:55:36.675657: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:17:00.0
totalMemory: 11.00GiB freeMemory: 9.08GiB
2018-09-15 02:55:36.798520: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:65:00.0
totalMemory: 11.00GiB freeMemory: 9.08GiB
2018-09-15 02:55:36.799044: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0, 1
2018-09-15 02:55:38.234984: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-15 02:55:38.235236: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0 1
2018-09-15 02:55:38.235392: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0: N N
2018-09-15 02:55:38.235559: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 1: N N
2018-09-15 02:55:38.235849: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8783 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1)
2018-09-15 02:55:38.601267: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8783 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
Tensorflow's default is to consume the memory of all visible GPUs, but unless you code for using multiple GPUs only the first of the two will be used for computation.
You would typically set the environment variable export CUDA_VISIBLE_DEVICES=0prior to running python to limit tensorflow to only seeing gpu0, for example. (0=gpu0, 1=gpu1, etc, -1=cpu only)
Using both GPUs for computation requires that you code for multiple GPUs (and make decisions about what that means in your model), there are many tutorials on the topic, here's one quick one I pulled up: http://blog.s-schoener.com/2017-12-15-parallel-tensorflow-intro/

multiple GPUs keras weird speedup

I did implement a similar code like the multi GPU code from keras
(multiGPU tutorial). When running this on a Server with 2 GPUs I have the following training times per epoch:
showing Keras only one GPU and setting variable gpus = 1 (only use one GPU), one epoch = 32s
showing Keras two GPUs, and gpus = 1, one epoch = 31 s
showing Keras two GPUs, and gpus = 2, one epoch = 37 s
the output looks a bit strange, while initializing the code seems to create multiple Tensorflow devices per GPU, I'm not sure if this is the correct behavior. But the most other examples I saw had just one such line per GPU.
first test (one GPU shown, gpus = 1):
2017-12-04 14:54:04.071549: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla P100-PCIE-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.3285
pciBusID 0000:82:00.0
Total memory: 15.93GiB
Free memory: 15.64GiB
2017-12-04 14:54:04.071597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-12-04 14:54:04.071605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-12-04 14:54:04.071619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:54:21.531654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
second test (2 GPU shown, gpus = 1):
2017-12-04 14:48:24.881733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties:
...(same as earlier)
2017-12-04 14:48:24.882924: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:48:24.882931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0)
2017-12-04 14:48:42.353807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:48:42.353851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0)
and weirdly for example 3 (gpus = 2):
2017-12-04 14:41:35.906828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties:
...(same as earlier)
2017-12-04 14:41:35.907996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:41:35.908002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0)
2017-12-04 14:41:52.944335: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:41:52.944377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0)
2017-12-04 14:41:53.709812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:41:53.709838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0)
the code:
LSTM = keras.layers.CuDNNLSTM
model.add(LSTM(knots, input_shape=(timesteps, X_train.shape[-1]), return_sequences=True))
model.add(LSTM(knots))
model.add(Dense(3, activation='softmax'))
if gpus>=2:
model_basic = model
with tf.device("/cpu:0"):
model = model_basic
parallel_model = multi_gpu_model(model, gpus=gpus)
model = parallel_model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
hist = model.fit(myParameter)
Is this a typical behavior? What is wrong with my code that the multiple devices per GPU are created. Thanks in advance.
I tried the exact code of multiGPU tutorial.
It looks like its somehow the expected output. But to see the expected speed differences I had to increase the number of samples (20000) and needed to height and width to 100 (due to RAM limits).
I'm not completely sure why in my case I didn't see a speedup with two GPU. I expect it to be due to limits of the memory speed. Because my batch size is rather small and each sample is also small. This leads to the effect that the managing of the data needs more time than the actual calculation.
The distribution of the data gets even more time consuming when using 2 GPUs, while the actual runtime on each GPU decreases.
This effect could be proven if I could check the utilization of the graphics cards. Sadly I don't know how to do this.
If anyone has other ideas on this, let me know. Thanks

TensorFlow Nvidia 1070 GPU memory allocation errors how to troubleshoot?

long shot:
Ubuntu 16.04 Nvidia 1070 with 8Gig on board? Machine has 64 Gig ram and dataset is 1 million records and current cuda, cdnn libraries, TensorFlow 1.0 Python 3.6
Not sure how to troubleshoot?
I have been working on getting some models up with TensorFlow and have run into this phenomena a number of times:I don't know of anything other than TensorFlow using the GPU memory?
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.645
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.56GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.92G (8499298304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 1.50G (1614867456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 1.50G (1614867456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cu
The following I get this which indicates some sort of memory allocation is going on? yet still failing.
`I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 899200000 totalling 4.19GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1649756928 totalling 1.54GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.40GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 8499298304
InUse: 6875780608
MaxInUse: 6878976000
NumAllocs: 338
MaxAllocSize: 1649756928
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ******************************************************************************************xxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 6.10MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:993] Internal: Dst tensor is not initialized.
[[Node: linear/linear/marital_status/marital_status_weights/embedding_lookup_sparse/strided_slice/_1055 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1643_linear/linear/marital_status/marital_status_weights/embedding_lookup_sparse/strided_slice", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
`
Update: I reduced the record count from millions to 40,000 and got a base model to run to completion. I still get an error message but not continuous ones. I get a bunch of text in the model output suggesting restructuring the model and I suspect that the data structure is a big part of the problem. Could still use some better hints as to how to debug the entire process.. Below is the remaining console output
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.645
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.52GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.92G (8499298304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
[I 09:13:09.297 NotebookApp] Saving file at /Documents/InfluenceH/Working_copies/Cond_fcast_wkg/TensorFlow+DNNLinearCombinedClassifier+for+Influence.ipynb
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
I think the problem is that TensorFlow tries to allocate 7.92GB of GPU memory, while only 7.56GB are actually free. I cannot tell you for what reason the rest of the GPU memory is occupied, but you might avoid this problem by limiting the fraction of the GPU memory your program is allowed to allocate:
sess_config = tf.ConfigProto()
sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
with tf.Session(config=sess_config, ...) as ...:
With this, the program will only allocate 90 percent of the GPU memory, i.e. 7.13GB.

Tensorflow. One set of GPUs on same machine and same model work well, another gets OOM error

I am using multiple GPUs (num_gpus = 4) for training one model with multiple towers. The model is training well on one set of GPUs: CUDA_VISIBLE_DEVICES = 0,1,2,3 while it gets OOM problem during the first graph evaluation with CUDA_VISIBLE_DEVICES = 0,1,4,5
Anyone has any ideas why this is happening?
Following options are used for creating a session
session_config=tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=False)
session_config.gpu_options.per_process_gpu_memory_fraction = 0.94
session_config.gpu_options.allow_growth=False
Batch size, is already super small, = 3
System information
Tensorflow 1.0
Cuda 8.0
Ubuntu 14.04.5 LTS
All GPUs : GeForce GTX 1080
logs
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:07:00.0 Total memory: 7.92GiB Free memory: 7.81GiB W
tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context
when one is currently active; existing: 0xcc4593a0 I
tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1
with properties: name: GeForce GTX 1080 major: 6 minor: 1
memoryClockRate (GHz) 1.7335 pciBusID 0000:08:00.0 Total memory:
7.92GiB Free memory: 7.81GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context
when one is currently active; existing: 0xd2404670 I
tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 2
with properties: name: GeForce GTX 1080 major: 6 minor: 1
memoryClockRate (GHz) 1.7335 pciBusID 0000:18:00.0 Total memory:
7.92GiB Free memory: 7.81GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context
when one is currently active; existing: 0xd25591b0 I
tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 3
with properties: name: GeForce GTX 1080 major: 6 minor: 1
memoryClockRate (GHz) 1.7335 pciBusID 0000:1c:00.0 Total memory:
7.92GiB Free memory: 7.81GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3 I
tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y Y Y I
tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y Y Y I
tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2: Y Y Y Y I
tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3: Y Y Y Y I
tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci
bus id: 0000:07:00.0) I
tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080, pci
bus id: 0000:08:00.0) I
tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX 1080, pci
bus id: 0000:18:00.0) I
tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX 1080, pci
bus id: 0000:1c:00.0)
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247]
PoolAllocator: After 47441 get requests, put_count=8461
evicted_count=1000 eviction_rate=0.118189 and unsatisfied allocation
rate=0.844839 I
tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising
pool_size_limit_ from 100 to 110 W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 2.33GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 3.08GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 3.08GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 3.98GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 3.98GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 2.54GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 2.54GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 3.17GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 2.68GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 3.86GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. I
tensorflow/core/common_runtime/gpu/pool_allocator.cc:247]
PoolAllocator: After 2698 get requests, put_count=8709 evicted_c
Is the log from a good run, or a bad run? There is no error there, only warnings.
If your system have dual root complex, 0,1,4,5 could be on different partitions. The DMA matrix would show that. Copies between GPUs on the same root complex are generally faster than cross them. If the copy has to hold on the tensor reference for longer due to the longer copy, you might see an increased peak memory usage, and leads to OOM if your models are too close to the limit. Of course, it's just a theory, and without further debugging info, it's difficult to tell for sure.