how to debug local variables in tensorflow - tensorflow

i'd like to print the value of tensor in tensorflow,but it falied,how can i correct it?
train_vector1, train_vector2, train_vector3, train_vector4, train_vector5,train_labels = decode_records(FLAGS.record_train, FLAGS.epoch, record_params)
sess = tf.Session(config=session_conf)
print(sess.run(train_labels))
and run the tf.py,the progress is hunted.why?
2018-06-15 16:52:53.782143: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-06-15 16:52:54.111552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1212] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:0d:00.0
totalMemory: 15.89GiB freeMemory: 15.60GiB
2018-06-15 16:52:54.111607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1312] Adding visible gpu devices: 0
2018-06-15 16:52:54.408837: I tensorflow/core/common_runtime/gpu/gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15128 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:0d:00.0, compute capability: 6.0)

Related

How to use a single Gpu from multi Gpu system in tensorflow

I have a multi_gpu system. I tried to run the model on GPU 6. I tried different things to run it on the GPU id 6. I am not able to do that. How can I make it possible that GPU 6 will be available for the notebook? I also attached the code enter image description here
import os
import tensorflow as tf
os.environ["CUDA_VISIBLE_DEVICES"]="6"
print(tf.test.gpu_device_name())
from tensorflow.python.client import device_lib
devices_tf = device_lib.list_local_devices()
devices_tf = print(devices_tf)
Below is the output which I got from the above code:
/device:GPU:0
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 15691930590178259318
xla_global_id: -1
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 219742208
locality {
bus_id: 1
links {
}
}
incarnation: 10143841604030116090
physical_device_desc: "device: 0, name: GeForce RTX 3090, pci bus id:
0000:23:00.0, compute capability: 8.6"
xla_global_id: 416903419
]

Tensorflow 1.15 multi worker strategy hangs after graph initialization on multiple machiens

I am running the TF keras_to_estimator example on using two machines, the process hangs after graph initialization when running the start script on each machine.
The messages of console output on worker 0 machine after starting:
INFO:tensorflow:Multi-worker CollectiveAllReduceStrategy with cluster_spec = {'worker': ['node4:21111', 'node3:21112']}, task_type = 'worker', task_id = 0, num_workers = 2, local_devices = ('/job:worker/task:0',), communication = CollectiveCommunication.AUTO
I0605 17:05:20.218733 139934274328320 collective_all_reduce_strategy.py:310] Multi-worker CollectiveAllReduceStrategy with cluster_spec = {'worker': ['node4:21111', 'node3:21112']}, task_type = 'worker', task_id = 0, num_workers = 2, local_devices = ('/job:worker/task:0',
), communication = CollectiveCommunication.AUTO
INFO:tensorflow:Updated config: {'_model_dir': '/node4/jianwang/atp_bert/albert_zh/example_dir', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f44402717b8>, '_device_fn': None, '_protocol': None, '_eval_distribute': <tensorflow.contrib.distribute.python.mirrored_strategy.MirroredStrategy object at 0x7f44402755c0>, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.Collectiv
eAllReduceStrategy object at 0x7f4440275240>, eval_distribute=<tensorflow.contrib.distribute.python.mirrored_strategy.MirroredStrategy object at 0x7f44402755c0>, remote_cluster=None), '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f4440275940>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://node4:21111', '_evaluation_master': 'grpc://node4:21111', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 2, '_distribute_coordinator_mode': 'independent_worker'}
I0605 17:05:20.221589 139934274328320 estimator_training.py:228] Updated config: {'_model_dir': '/node4/jianwang/atp_bert/albert_zh/example_dir', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session
_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': <tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f44402717b8>, '_device_fn': None, '_proto
col': None, '_eval_distribute': <tensorflow.contrib.distribute.python.mirrored_strategy.MirroredStrategy object at 0x7f44402755c0>, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.Collectiv
eAllReduceStrategy object at 0x7f4440275240>, eval_distribute=<tensorflow.contrib.distribute.python.mirrored_strategy.MirroredStrategy object at 0x7f44402755c0>, remote_cluster=None), '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_s
ervice': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f4440275940>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': 'grpc://node4:21111', '_evaluation_master': 'grpc://node4:21111', '_is_chief': True,
'_num_ps_replicas': 0, '_num_worker_replicas': 2, '_distribute_coordinator_mode': 'independent_worker'}
input_fn called
INFO:tensorflow:Calling model_fn.
I0605 17:05:20.358438 139911839606528 estimator.py:1148] Calling model_fn.
...
INFO:tensorflow:Creating chief session creator with config: device_filters: "/job:worker/task:0"
allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
scoped_allocator_optimization: ON
scoped_allocator_opts {
enable_op: "CollectiveReduce"
}
}
}
experimental {
collective_group_leader: "/job:worker/replica:0/task:0"
}
I0605 17:05:20.711247 139934274328320 distribute_coordinator.py:251] Creating chief session creator with config: device_filters: "/job:worker/task:0"
allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
scoped_allocator_optimization: ON
scoped_allocator_opts {
enable_op: "CollectiveReduce"
}
}
}
experimental {
collective_group_leader: "/job:worker/replica:0/task:0"
}
INFO:tensorflow:Graph was finalized.
I0605 17:05:20.870544 139934274328320 monitored_session.py:240] Graph was finalized.
The same message is also print out on the worker 1 machine which also shows that the process is stuck after graph initialization
I0605 17:10:28.616780 140121708521216 collective_all_reduce_strategy.py:310] Multi-worker CollectiveAllReduceStrategy with cluster_spec = {'worker': ['node4:21111', 'node3:21112']}, task_type = 'worker', task_id = 1, num_workers = 2, local_devices = ('/job:worker/task:1',
), communication = CollectiveCommunication.AUTO
INFO:tensorflow:Updated config: {'_model_dir': '/node4/jianwang/atp_bert/albert_zh/example_dir', '_num_ps_replicas': 0, '_tf_random_seed': None, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_experimental_max_worker_delay_secs': None, '_eval_distribute': <tensorflow.contrib.distribute.python.mirrored_strategy.MirroredStrategy object at 0x7f7085a28128>, '_save_checkpoints_secs': 600, '_keep_checkpoint_every_n_hours': 10000, '_is_chief': False, '_keep_checkp
oint_max': 5, '_device_fn': None, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f7085a1eef0>, eval_distribute=<tensorflow.contrib.distribute.pytho
n.mirrored_strategy.MirroredStrategy object at 0x7f7085a28128>, remote_cluster=None), '_session_creation_timeout_secs': 7200, '_master': 'grpc://node3:21112', '_service': None, '_task_type': 'worker', '_task_id': 1, '_protocol': None, '_log_step_count_steps': 100, '_distr
ibute_coordinator_mode': 'independent_worker', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f7085a282e8>, '_global_id_in_cluster': 1, '_evaluation_master': 'grpc://node3:21112', '_train_distribute': <tensorflow.contrib.distribute.python
.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f7085a1e8d0>, '_num_worker_replicas': 2, '_save_checkpoints_steps': None, '_save_summary_steps': 100}
I0605 17:10:28.623507 140121708521216 estimator_training.py:228] Updated config: {'_model_dir': '/node4/jianwang/atp_bert/albert_zh/example_dir', '_num_ps_replicas': 0, '_tf_random_seed': None, '_session_config': allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
}
}
, '_experimental_max_worker_delay_secs': None, '_eval_distribute': <tensorflow.contrib.distribute.python.mirrored_strategy.MirroredStrategy object at 0x7f7085a28128>, '_save_checkpoints_secs': 600, '_keep_checkpoint_every_n_hours': 10000, '_is_chief': False, '_keep_checkp
oint_max': 5, '_device_fn': None, '_experimental_distribute': DistributeConfig(train_distribute=<tensorflow.contrib.distribute.python.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f7085a1eef0>, eval_distribute=<tensorflow.contrib.distribute.pytho
n.mirrored_strategy.MirroredStrategy object at 0x7f7085a28128>, remote_cluster=None), '_session_creation_timeout_secs': 7200, '_master': 'grpc://node3:21112', '_service': None, '_task_type': 'worker', '_task_id': 1, '_protocol': None, '_log_step_count_steps': 100, '_distr
ibute_coordinator_mode': 'independent_worker', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f7085a282e8>, '_global_id_in_cluster': 1, '_evaluation_master': 'grpc://node3:21112', '_train_distribute': <tensorflow.contrib.distribute.python
.collective_all_reduce_strategy.CollectiveAllReduceStrategy object at 0x7f7085a1e8d0>, '_num_worker_replicas': 2, '_save_checkpoints_steps': None, '_save_summary_steps': 100}
input_fn called
INFO:tensorflow:Calling model_fn.
...
INFO:tensorflow:Creating chief session creator with config: device_filters: "/job:worker/task:1"
allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
scoped_allocator_optimization: ON
scoped_allocator_opts {
enable_op: "CollectiveReduce"
}
}
}
experimental {
collective_group_leader: "/job:worker/replica:0/task:0"
}
I0605 17:10:29.048442 140121708521216 distribute_coordinator.py:251] Creating chief session creator with config: device_filters: "/job:worker/task:1"
allow_soft_placement: true
graph_options {
rewrite_options {
meta_optimizer_iterations: ONE
scoped_allocator_optimization: ON
scoped_allocator_opts {
enable_op: "CollectiveReduce"
}
}
}
experimental {
collective_group_leader: "/job:worker/replica:0/task:0"
Code related: (1) example.sh (start script running on node4 the worker 0 machine)
export TF_CONFIG='{
"cluster": {
"worker": ["node4:21111", "node3:21112"]
},
"task": {"type": "worker", "index": 0}
}'
export CUDA_VISIBLE_DEVICES=0
export OUTPUT_DIR=/node4/jianwang/atp_bert/albert_zh/example_dir
python example.py $OUTPUT_DIR
(2) example_slave.sh (start script to run on the worker 1 machine)
export TF_CONFIG='{
"cluster": {
"worker": ["node4:21111", "node3:21112"]
},
"task": {"type": "worker", "index": 1}
}'
export CUDA_VISIBLE_DEVICES=7
export OUTPUT_DIR=/node4/jianwang/atp_bert/albert_zh/example_dir
python example.py $OUTPUT_DIR
(3) example.py
"""An example of training Keras model with multi-worker strategies."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import sys
import numpy as np
import tensorflow as tf
def input_fn():
print("input_fn called")
x = np.random.random((1024, 10))
y = np.random.randint(2, size=(1024, 1))
x = tf.cast(x, tf.float32)
dataset = tf.data.Dataset.from_tensor_slices((x, y))
dataset = dataset.repeat(100)
dataset = dataset.batch(32)
return dataset
def main(args):
if len(args) < 2:
print('You must specify model_dir for checkpoints such as'
' /tmp/tfkeras_example/.')
return
model_dir = args[1]
print('Using %s to store checkpoints.' % model_dir)
# Define a Keras Model.
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(16, activation='relu', input_shape=(10,)))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
# Compile the model.
optimizer = tf.train.GradientDescentOptimizer(0.2)
model.compile(loss='binary_crossentropy', optimizer=optimizer)
model.summary()
tf.keras.backend.set_learning_phase(True)
# Define DistributionStrategies and convert the Keras Model to an
# Estimator that utilizes these DistributionStrateges.
# Evaluator is a single worker, so using MirroredStrategy.
config = tf.estimator.RunConfig(
experimental_distribute=tf.contrib.distribute.DistributeConfig(
train_distribute=tf.contrib.distribute.CollectiveAllReduceStrategy(
),
eval_distribute=tf.contrib.distribute.MirroredStrategy(
)))
keras_estimator = tf.keras.estimator.model_to_estimator(
keras_model=model, config=config, model_dir=model_dir)
# Train and evaluate the model. Evaluation will be skipped if there is not an
# "evaluator" job in the cluster.
print("Start train eval")
tf.estimator.train_and_evaluate(
keras_estimator,
train_spec=tf.estimator.TrainSpec(input_fn=input_fn),
eval_spec=tf.estimator.EvalSpec(input_fn=input_fn))
if __name__ == '__main__':
tf.logging.set_verbosity(tf.logging.INFO)
tf.app.run(argv=sys.argv)
lspci output:
lspci |grep PCI
jianwang#node3:~$ lspci |grep PCI
00:01.0 PCI bridge: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 1 (rev 01)
00:02.0 PCI bridge: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 2 (rev 01)
00:03.0 PCI bridge: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 3 (rev 01)
00:1c.0 PCI bridge: Intel Corporation C610/X99 series chipset PCI Express Root Port #1 (rev d5)
00:1c.7 PCI bridge: Intel Corporation C610/X99 series chipset PCI Express Root Port #8 (rev d5)
02:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
03:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
03:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
06:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
07:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
07:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
0b:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03)
7f:10.0 System peripheral: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R2PCIe Agent (rev 01)
7f:10.1 Performance counters: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R2PCIe Agent (rev 01)
80:00.0 PCI bridge: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 0 (rev 01)
80:01.0 PCI bridge: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 1 (rev 01)
80:02.0 PCI bridge: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 2 (rev 01)
80:03.0 PCI bridge: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D PCI Express Root Port 3 (rev 01)
83:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
84:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
84:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
87:00.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
88:08.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
88:10.0 PCI bridge: PLX Technology, Inc. PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch (rev ca)
ff:10.0 System peripheral: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R2PCIe Agent (rev 01)
ff:10.1 Performance counters: Intel Corporation Xeon E7 v4/Xeon E5 v4/Xeon E3 v4/Xeon D R2PCIe Agent (rev 01)
I have tried to add "chief" to TF_CONFIG, disable IOMMU followed by : disable ioMMU
none worked. Please help on :
(1) how to diagnose the problem on where it hangs
(2) any insights on how to work around this problem

How to configure Tensorflow to use a specific GPU?

These are the activated devices that I have:
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 5415837867258701517
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 3198956339
locality {
bus_id: 1
links {
}
}
incarnation: 12462133041849407996
physical_device_desc: "device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0"
]
What I want to do is to configure my program to use GeForce GTX 960M and also make this configuration permanent for all my previous/future programs if is it possible?
try with the function: set_visible_devices
physical_devices = tf.config.list_physical_devices('GPU')
tf.config.set_visible_devices(physical_devices[1:],'GPU')
Where you can specify which GPUs you would like to use

Tensorflow allocating all memory for any program

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): linux Ubuntu 16.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v1.4.0-rc1
Python version: 3.5.5
CUDA/cuDNN version: CUDA 8.0 / cuDNN 6
GPU model and memory: nvidia gtx 1080
I am new to Tensorflow. So this could easily be some silly installation error that I don't see.
I open python to test TF installation:
import tensorflow as tf
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
Resulting in:
I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX
2018-04-11 21:39:44.830140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate(GHz): 1.8475
pciBusID: 0000:01:00.0
totalMemory: 7.92GiB freeMemory: 78.94MiB
2018-04-11 21:39:44.830178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-04-11 21:39:44.832231: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 78.94M (82771968 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-04-11 21:39:44.834394: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 71.04M (74494976 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-04-11 21:39:44.835825: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 63.94M (67045632 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-04-11 21:39:44.837560: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 57.55M (60341248 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-04-11 21:39:44.839233: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 51.79M (54307328 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-04-11 21:39:44.841757: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 46.61M (48876800 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-04-11 21:39:44.843632: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 41.95M (43989248 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-04-11 21:39:44.845588: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 37.76M (39590400 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-04-11 21:39:44.847229: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 33.98M (35631360 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-04-11 21:39:44.849278: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 30.58M (32068352 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2018-04-11 21:39:44.850967: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 27.52M (28861696 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 6037705122138393497
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 82771968
locality {
bus_id: 1
}
incarnation: 11403601020071115295
physical_device_desc: "device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1"
]
Supposing your question is "Why does Tensorflow allocate all available GPU memory even though much less memory would be enough for my program?", then the answer is that they do this to reduce GPU memory fragmentation. You can change this default behavior with some settings like config.gpu_options.allow_growth and config.gpu_options.per_process_gpu_memory_fraction to make Tensorflow less memory hungry at the expense of allowing some potential memory fragmentation to occur. Detailed explanation in the Tensorflow Programmer's Guide Using GPU chapter.

Keras and TensorFlow: What means "Peer access not supported between device ordinals 0 and 1" and how to fix it?

I have 2 GPUs installed and when i train a model i get the following message. What means "Peer access not supported between device ordinals 0 and 1" and "Peer access not supported between device ordinals 1 and 0"? Is it an error is it something i have to fix? I mean, the model itself trains successfully in the end. I think it uses only one of the GPUs, not both. But i want to understand this message and fix the problem. Is there something i need to do?
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\dso_loader.cc:135] successfully opened CUDA library cublas64_80.dll locally
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\dso_loader.cc:135] successfully opened CUDA library cudnn64_5.dll locally
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\dso_loader.cc:135] successfully opened CUDA library cufft64_80.dll locally
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\dso_loader.cc:135] successfully opened CUDA library nvcuda.dll locally
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\dso_loader.cc:135] successfully opened CUDA library curand64_80.dll locally
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\framework\op_kernel.cc:943] OpKernel ('op: "BestSplits" device_type: "CPU"') for unknown op: BestSplits
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\framework\op_kernel.cc:943] OpKernel ('op: "CountExtremelyRandomStats" device_type: "CPU"') for unknown op: CountExtremelyRandomStats
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\framework\op_kernel.cc:943] OpKernel ('op: "FinishedNodes" device_type: "CPU"') for unknown op: FinishedNodes
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\framework\op_kernel.cc:943] OpKernel ('op: "GrowTree" device_type: "CPU"') for unknown op: GrowTree
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\framework\op_kernel.cc:943] OpKernel ('op: "ReinterpretStringToFloat" device_type: "CPU"') for unknown op: ReinterpretStringToFloat
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\framework\op_kernel.cc:943] OpKernel ('op: "SampleInputs" device_type: "CPU"') for unknown op: SampleInputs
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\framework\op_kernel.cc:943] OpKernel ('op: "ScatterAddNdim" device_type: "CPU"') for unknown op: ScatterAddNdim
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\framework\op_kernel.cc:943] OpKernel ('op: "TopNInsert" device_type: "CPU"') for unknown op: TopNInsert
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\framework\op_kernel.cc:943] OpKernel ('op: "TopNRemove" device_type: "CPU"') for unknown op: TopNRemove
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\framework\op_kernel.cc:943] OpKernel ('op: "TreePredictions" device_type: "CPU"') for unknown op: TreePredictions
E c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\framework\op_kernel.cc:943] OpKernel ('op: "UpdateFertileSlots" device_type: "CPU"') for unknown op: UpdateFertileSlots
Using TensorFlow backend.
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 970
major: 5 minor: 2 memoryClockRate (GHz) 1.253
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 3.31GiB
W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\stream_executor\cuda\cuda_driver.cc:590] creating context when one is currently active; existing: 0000022BB5DD0500
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:885] Found device 1 with properties:
name: GeForce GTX 970
major: 5 minor: 2 memoryClockRate (GHz) 1.253
pciBusID 0000:02:00.0
Total memory: 4.00GiB
Free memory: 3.31GiB
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:777] Peer access not supported between device ordinals 0 and 1
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:777] Peer access not supported between device ordinals 1 and 0
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:906] DMA: 0 1
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:916] 0: Y N
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:916] 1: N Y
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970, pci bus id: 0000:01:00.0)
I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\gpu\gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 970, pci bus id: 0000:02:00.0)
This just means that the gpus cannot communicate (pass information between gpu0 to gpu1 or vice versa) without passing the data back to cpu first.