I've run a image processing script using tensorflow API. It turns out that the processing time decreased quickly when I set the for-loop outside the session running procedure. Could anyone tell me why? Is there any side-effects?
The original code:
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(len(file_list)):
start = time.time()
image_crop, bboxs_crop = sess.run(crop_image(file_list[i], bboxs_list[i], sess))
print( 'Done image %d th in %d ms \n'% (i, ((time.time() - start)*1000)))
# image_crop, bboxs_crop, image_debug = sess.run(crop_image(file_list[i], bboxs_list[i], sess))
labels, bboxs = filter_bbox(labels_list[i], bboxs_crop)
# Image._show(Image.fromarray(np.asarray(image_crop)))
# Image._show(Image.fromarray(np.asarray(image_debug)))
save_image(image_crop, ntpath.basename(file_list[i]))
#save_desc_file(file_list[i], labels_list[i], bboxs_crop)
save_desc_file(file_list[i], labels, bboxs)
coord.request_stop()
coord.join(threads)
The code modified:
for i in range(len(file_list)):
with tf.Graph().as_default(), tf.Session() as sess:
start = time.time()
image_crop, bboxs_crop = sess.run(crop_image(file_list[i], bboxs_list[i], sess))
print( 'Done image %d th in %d ms \n'% (i, ((time.time() - start)*1000)))
labels, bboxs = filter_bbox(labels_list[i], bboxs_crop)
save_image(image_crop, ntpath.basename(file_list[i]))
save_desc_file(file_list[i], labels, bboxs)
The time cost in the original code would keep increasing from 200ms to even 20000ms. While after modified, the the logs messages indicate that there are more than one graph and tensorflow devices were created during running, why is that?
python random_crop_images_hongyuan.py I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcublas.so.8.0 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcudnn.so.5 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcufft.so.8.0 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcuda.so.1 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcurand.so.8.0 locally W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE3 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.1 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.2 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use AVX instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use AVX2 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use FMA instructions, but these are
available on your machine and could speed up CPU computations. I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful
NUMA node read from SysFS had negative value (-1), but there must be
at least one NUMA node, so returning NUMA node zero I
tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0
with properties: name: GeForce GT 730M major: 3 minor: 5
memoryClockRate (GHz) 0.758 pciBusID 0000:01:00.0 Total memory:
982.88MiB Free memory: 592.44MiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I
tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I
tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3000 th in 317 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3001 th in 325 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3002 th in 312 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3003 th in 147 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3004 th in 447 ms
My guess is that this happens because creating the session is an expensive operation. May be it could also happen that the session is not properly cleaned when the with-statement is left, so each new allocation on the device will have less resources available. In short, I would not recommend doing it this way, rather initialize just one session and try to reuse it.
EDIT:
In answer to your comment: The session is closed automatically as soon as the with-block is exited. I've read in this github issue that the memory on the GPU is only really released when the whole program exits. But I guess that when you allocate a new session after you closed the last one, Tensorflow will internally just re-use the previously allocated resources. So, in retrospective my answer is probably not very insightful. Sorry if I caused confusion.
It's not possible to be 100% certain without seeing all of your code, but I would guess that the crop_image() function is calling various TensorFlow op functions to build a graph.
It is almost never a good idea to build a graph inside a for loop. This answer explains why: some operations (such as the first Session.run() call to a new operation) take time that is linear in the number of operations in the graph. If you add more operations in each iteration, iteration i will do work that is linear in i, and so the overall execution time will be quadratic.
The modified version of your code (with a with tf.Graph().as_default(): block inside the loop) will be faster because it creates a new, empty tf.Graph in each iteration, and therefore each iteration does a constant amount of work.
An even more efficient solution would be to build the graph and session once, using tf.placeholder() tensors to represent the filename and bbox arguments to crop_image, and feeding different values to these placeholders in each iteration.
Related
I am trying to run StyleGAN2 using a cluster equipped with eight GPUs (NVIDIA GeForce RTX 2080). At present, I am using the following configuration in training_loop.py:
minibatch_size_dict = {4: 512, 8: 256, 16: 128, 32: 64, 64: 32}, # Resolution-specific overrides.
minibatch_gpu_base = 8, # Number of samples processed at a time by one GPU.
minibatch_gpu_dict = {}, # Resolution-specific overrides.
G_lrate_base = 0.001, # Learning rate for the generator.
G_lrate_dict = {}, # Resolution-specific overrides.
D_lrate_base = 0.001, # Learning rate for the discriminator.
D_lrate_dict = {}, # Resolution-specific overrides.
lrate_rampup_kimg = 0, # Duration of learning rate ramp-up.
tick_kimg_base = 4, # Default interval of progress snapshots.
tick_kimg_dict = {4:10, 8:10, 16:10, 32:10, 64:10, 128:8, 256:6, 512:4}): # Resolution-specific overrides.
I am training using a set of 512x52 pixel images. After a couple of iterations, I get the error message reported below and it looks like the script stops running (using watch nvidia-smi, we have that both the temperature and the fan activity for the GPUs decreases). I already reduced the batch size but it looks like the problem is somewhere else. Do you have any tip on how to fix this?
I was able to run StyleGAN with the same dataset. In the paper they say that StyleGAN2 should be less heavy, so I am a bit surprised.
Here is the error message I get:
2019-12-16 18:22:54.909009: E tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 334.11M (350338048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-12-16 18:22:54.909087: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 129.00MiB (rounded to 135268352). Current allocation summary follows.
2019-12-16 18:22:54.918750: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **_***************************_*****x****x******xx***_******************************_***************
2019-12-16 18:22:54.918808: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at conv_grad_input_ops.cc:903 : Resource exhausted: OOM when allocating tensor with shape[4,128,257,257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
The config-f model for StyleGAN2 is actually bigger than StyleGAN1. Try using a less VRAM consuming configuration like config-e. You can actually change the configuration of the model by passing a flag in your python command like so: https://github.com/NVlabs/stylegan2/blob/master/run_training.py#L144
In my case, I'm able to train StyleGAN2 with config-e on 2 RTX 2080ti.
One or more high-end NVIDIA GPUs, NVIDIA drivers, CUDA 10.0 toolkit
and cuDNN 7.5. To reproduce the results reported in the paper, you
need an NVIDIA GPU with at least 16 GB of DRAM.
Your NVIDIA GeForce RTX 2080 card has 11GB, but I guess you're saying you have 8 of them? I don't think tensorflow is setup for parallelism out of the box.
python model_main.py --model_dir=training/ --pipeline_config_path=training/faster_rcnn_inception_v2
_pets.config
WARNING:tensorflow:Estimator's model_fn (<function create_model_fn.<locals>.model_fn at 0x00000155CFDB5C80>) includes params argument, but params are not passed to Estimator.
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
WARNING:root:Variable [SecondStageBoxPredictor/BoxEncodingPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [SecondStageBoxPredictor/BoxEncodingPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [SecondStageBoxPredictor/ClassPredictor/biases] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [SecondStageBoxPredictor/ClassPredictor/weights] is available in checkpoint, but has an incompatible shape with model variable.
WARNING:root:Variable [global_step] is not available in checkpoint
D:\ProgramFiles\Anaconda\envs\Eneger\lib\site-packages\tensorflow\python\ops\gradients_impl.py:100: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may
consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2018-07-24 05:19:44.209751: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorF
low binary was not compiled to use: AVX AVX2
2018-07-24 05:19:45.534609: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1356] Found device 0 with properties:
name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 4.97GiB
2018-07-24 05:19:45.571588: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1435] Adding visible gpu devices: 0
2018-07-24 05:19:53.604693: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:923] Device interconnect StreamExecutor with stren
gth 1 edge matrix:
2018-07-24 05:19:53.615439: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:929] 0
2018-07-24 05:19:53.622183: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:942] 0: N
2018-07-24 05:19:53.643331: I C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1053] Created TensorFlow device (/job:localhost/re
plica:0/task:0/device:GPU:0 with 4730 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-07-24 05:20:58.481169: E C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\stream_executor\cuda\cuda_event.cc:49] Error polling for event status: failed to query e
vent: **CUDA_ERROR_ILLEGAL_INSTRUCTION**
2018-07-24 05:20:58.511921: F C:\users\nwani\_bazel_nwani\mmtm6wb6\execroot\org_tensorflow\tensorflow\core\common_runtime\gpu\gpu_event_mgr.cc:208] Unexpected Event status: 1
There is an open Issue on tensorflow github since March:
https://github.com/tensorflow/tensorflow/issues/17747
But seems that the error appears randomly, not all times, can you confirm that? The last answer is 10 days ago, so i think they're still working on it.
L-
I would first check that your versions of cuda and cudnn are the compatible ones for your tensorflow versions. For example cuda toolkit 9.0 and not cuda toolkit 9.1 or cudnn 7.0 and not 7.05.
After that I would update the drivers of the GPU form the Official Nvidia website.
I would not only check that you have CUDA 9.0 and cuDNN 7.0, but I would also make sure that the CuDNN version you selected and installed is the one providing enabled for CUDA 9.0 (there are versions of CuDNN for CUDA 8.0, 9.0 and 9.1).
I am running tensorflow 1.7 on my local machine which contains 2 GPU's each of around 8 GB.
Training(train.py) the Object detection is working when i am using the model 'faster_rcnn_resnet101_coco'. But When i tried to run 'faster_rcnn_nas_coco' its showing an error of 'Resource exhausted'
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py:736: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-05-02 16:14:53.963966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0, 1
2018-05-02 16:14:53.964071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-02 16:14:53.964083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 1
2018-05-02 16:14:53.964091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N Y
2018-05-02 16:14:53.964097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 1: Y N
2018-05-02 16:14:53.964566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7385 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-05-02 16:14:53.966360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7552 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1070, pci bus id: 0000:03:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from training/model.ckpt-0
INFO:tensorflow:Restoring parameters from training/model.ckpt-0
Limit: 7744048333
InUse: 7699536896
MaxInUse: 7699551744
NumAllocs: 10260
MaxAllocSize: 4076716032
2018-05-02 16:16:52.223943: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ***********************************************************************************x****************
2018-05-02 16:16:52.223967: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at depthwise_conv_op.cc:358 : Resource exhausted: OOM when allocating tensor with shape[64,672,9,9] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
I am not sure !! If its using both GPU's because it showing in usage memory as '7699536896'. After going through the train.py, I also tried
python train.py \
--logtostderr \
--worker_replicas=2 \
--pipeline_config_path=training/faster_rcnn_resnet101_coco.config \
--train_dir=training
If 2 GPU's are available, Does tensorflow by default choose both of them ? or does it need any arguments ?
We use the number of GPU specified by worker_replicas. For the NASNet case, try decreasing the batch-size to make the network fit on GPU.
I am trying to train my model on a GPU instead of a CPU on an AWS p2.xlarge instance from my Jupyter Notebook. I am using tensorflow-gpu backend (only tensorflow-gpu was installed and mentioned in requirements.txt and not tensorflow).
I am not seeing any speed improvements when training models on these instances compared to using a CPU, infact I am getting training speeds per epoch that is almost same as I am getting on my 4-core laptop CPU (p2.xlarge also has 4 vCPUs with a Tesla K80 GPU). I am not sure if i need to do some changes to my code to accommodate faster/parallel processing that GPU can offer. I am pasting below my code for my model:
model = Sequential()
model.add(recurrent.LSTM(64, input_shape=(X_np.shape[1], X_np.shape[2]),
return_sequences=True))
model.add(recurrent.LSTM(64, return_sequences = False))
model.add(core.Dropout(0.1))
model.add(core.Dense(3, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer = 'rmsprop', metrics=['accuracy'])
model.fit(X_np, y_np, epochs=100, validation_split=0.25)
Also interestingly the GPU seems to be utilizing between 50%-60% of its processing power and almost all of its memory every time I check for GPU status using nvidia-smi (but both fall to 0% and 1MiB respectively when not training):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 47C P0 73W / 149W | 10919MiB / 11439MiB | 52% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1665 C ...ubuntu/aDash/MLenv/bin/python 10906MiB |
+-----------------------------------------------------------------------------+
Also if you'd like to see my logs about using the GPU from Jupyter Notebook:
[I 04:21:59.390 NotebookApp] Kernel started: c17bc4d1-fa15-4b0e-b5f0-87f90e56bf65
[I 04:22:02.241 NotebookApp] Adapting to protocol v5.1 for kernel c17bc4d1-fa15-4b0e-b5f0-87f90e56bf65
2017-11-30 04:22:32.403981: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-30 04:22:33.653681: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-30 04:22:33.654041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2017-11-30 04:22:33.654070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2017-11-30 04:22:34.014329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
2017-11-30 04:22:34.015339: I tensorflow/core/common_runtime/direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
2017-11-30 04:23:22.426895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Please suggest what could be the problem. Thanks a ton for looking at this anyways!
That happens because you're using LSTM layers.
Tensorflow's implementation for LSTM layers is not that great. The reason is probably that recurrent calculations are not parallel calculations, and GPUs are great for parallel processing.
I confirmed that by my own experience:
Got terrible speed using LSTMs in my model
Decided to test the model removing all LSTMs (got a pure convolutional model)
The resulting speed was simply astonishing!!!
This article about using GPUs and tensorflow also confirms that:
http://minimaxir.com/2017/07/cpu-or-gpu/
A possible solution?
You may try using the new CuDNNLSTM, which seems prepared specially for using GPUs.
I never tested it, but you'll most probably get a better performance with this.
Another thing that I haven't tested, and I'm not sure it's designed for that reason, but I suspect it is: you can put unroll=True in your LSTM layers. With that, I suspect the recurrent calculations will be transformed in parallel ones.
Try to use some bigger value for batch_size in model.fit, because the default is 32. Increase it until you get 100% CPU utilization.
Following suggestion from #dgumo, you can also put your data into /run/shm. This is an in-memory file system, which allows to access data in fastest possible way. Alternatively, you can ensure that your data resides at least on SSD. For example in /tmp.
The bottleneck in your case is transferring data to and from the GPU. The best way to speed up your computation (and maximize your GPU usage) is to load as much of your data as your memory can hold at once. Since you have plenty of memory, you can put all your data at once, by doing:
model.fit(X_np, y_np, epochs=100, validation_split=0.25, batch_size=X_np.shape[0])
(You should also probably increase the number of epochs when doing this).
Note however that there are advantages to minibatching (e.g. better handling of local minima), so you should probably consider choosing a batch_size somewhere in between.
I'm going to train a seq2seq model using tf-seq2seq package by 1080 ti (11GB) GPU. I always get the following error using different network's size (even nmt_small):
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Graphics Device
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:03:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Graphics Device, pci bus id: 0000:03:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 10.91G (11715084288 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 12337 get requests, put_count=10124 evicted_count=1000 eviction_rate=0.0987752 and unsatisfied allocation rate=0.268542
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Saving checkpoints for 1 into ../model/model.ckpt.
INFO:tensorflow:step = 1, loss = 5.07399
It seems that tensorflow try to occupy the total amount of the GPU's memory (10.91GiB) but clearly only 10.75GiB is available.
you should notice some tips:
1- use memory growth, from tensorflow document: "in some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as is needed by the process. TensorFlow provides two Config options on the Session to control this."
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
2- are you use batch to training? or feed whole data at once? if yes, then decrease your batch size
In addition to both of the suggestions made concerning the memory growth, you can also try:
sess_config = tf.ConfigProto()
sess_config.gpu_options.per_process_gpu_memory_fraction = 0.90
with tf.Session(config=sess_config) as sess:
...
With this you can limit the amount of GPU memory allocated by the program, in this case to 90 percent of the available GPU memory. Maybe this is sufficient to solve your problem of the network trying to allocate more memory than available.
If this is not sufficient, you will have to decrease the batch size or the network's size.