Keras shows no Improvements to training speed with GPU (partial GPU usage?!) - tensorflow

I am trying to train my model on a GPU instead of a CPU on an AWS p2.xlarge instance from my Jupyter Notebook. I am using tensorflow-gpu backend (only tensorflow-gpu was installed and mentioned in requirements.txt and not tensorflow).
I am not seeing any speed improvements when training models on these instances compared to using a CPU, infact I am getting training speeds per epoch that is almost same as I am getting on my 4-core laptop CPU (p2.xlarge also has 4 vCPUs with a Tesla K80 GPU). I am not sure if i need to do some changes to my code to accommodate faster/parallel processing that GPU can offer. I am pasting below my code for my model:
model = Sequential()
model.add(recurrent.LSTM(64, input_shape=(X_np.shape[1], X_np.shape[2]),
return_sequences=True))
model.add(recurrent.LSTM(64, return_sequences = False))
model.add(core.Dropout(0.1))
model.add(core.Dense(3, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer = 'rmsprop', metrics=['accuracy'])
model.fit(X_np, y_np, epochs=100, validation_split=0.25)
Also interestingly the GPU seems to be utilizing between 50%-60% of its processing power and almost all of its memory every time I check for GPU status using nvidia-smi (but both fall to 0% and 1MiB respectively when not training):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 47C P0 73W / 149W | 10919MiB / 11439MiB | 52% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1665 C ...ubuntu/aDash/MLenv/bin/python 10906MiB |
+-----------------------------------------------------------------------------+
Also if you'd like to see my logs about using the GPU from Jupyter Notebook:
[I 04:21:59.390 NotebookApp] Kernel started: c17bc4d1-fa15-4b0e-b5f0-87f90e56bf65
[I 04:22:02.241 NotebookApp] Adapting to protocol v5.1 for kernel c17bc4d1-fa15-4b0e-b5f0-87f90e56bf65
2017-11-30 04:22:32.403981: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-30 04:22:33.653681: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-30 04:22:33.654041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2017-11-30 04:22:33.654070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2017-11-30 04:22:34.014329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
2017-11-30 04:22:34.015339: I tensorflow/core/common_runtime/direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
2017-11-30 04:23:22.426895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Please suggest what could be the problem. Thanks a ton for looking at this anyways!

That happens because you're using LSTM layers.
Tensorflow's implementation for LSTM layers is not that great. The reason is probably that recurrent calculations are not parallel calculations, and GPUs are great for parallel processing.
I confirmed that by my own experience:
Got terrible speed using LSTMs in my model
Decided to test the model removing all LSTMs (got a pure convolutional model)
The resulting speed was simply astonishing!!!
This article about using GPUs and tensorflow also confirms that:
http://minimaxir.com/2017/07/cpu-or-gpu/
A possible solution?
You may try using the new CuDNNLSTM, which seems prepared specially for using GPUs.
I never tested it, but you'll most probably get a better performance with this.
Another thing that I haven't tested, and I'm not sure it's designed for that reason, but I suspect it is: you can put unroll=True in your LSTM layers. With that, I suspect the recurrent calculations will be transformed in parallel ones.

Try to use some bigger value for batch_size in model.fit, because the default is 32. Increase it until you get 100% CPU utilization.
Following suggestion from #dgumo, you can also put your data into /run/shm. This is an in-memory file system, which allows to access data in fastest possible way. Alternatively, you can ensure that your data resides at least on SSD. For example in /tmp.

The bottleneck in your case is transferring data to and from the GPU. The best way to speed up your computation (and maximize your GPU usage) is to load as much of your data as your memory can hold at once. Since you have plenty of memory, you can put all your data at once, by doing:
model.fit(X_np, y_np, epochs=100, validation_split=0.25, batch_size=X_np.shape[0])
(You should also probably increase the number of epochs when doing this).
Note however that there are advantages to minibatching (e.g. better handling of local minima), so you should probably consider choosing a batch_size somewhere in between.

Related

Converting tensorflow session-based code to distribute.Mirroredstrategy

I'm pretty new to Tensorflow and I'll be the first to admit I'm a bit confused and turned around and might very well be barking up the wrong tree.
First: This is NOT a question about getting my GPUs working and seen by tensorflow(TF); I have verified from inside the container the GPU's are detected by TF. (using tensorflow/tensorflow:1.13.1-gpu-py3)
2020-02-20 22:24:25.233916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-20 22:24:25.233933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2 3 4 5
2020-02-20 22:24:25.233939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y Y Y Y Y
2020-02-20 22:24:25.233943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N Y Y Y Y
2020-02-20 22:24:25.233947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2: Y Y N Y Y Y
2020-02-20 22:24:25.233950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3: Y Y Y N Y Y
2020-02-20 22:24:25.233954: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 4: Y Y Y Y N Y
2020-02-20 22:24:25.233958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 5: Y Y Y Y Y N
2020-02-20 22:24:25.234135: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7623 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
2020-02-20 22:24:25.234370: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7624 MB memory) -> physical GPU (d
evice: 1, name: GeForce GTX 1070, pci bus id: 0000:02:00.0, compute capability: 6.1)
2020-02-20 22:24:25.234516: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 7624 MB memory) -> physical GPU (d
evice: 2, name: GeForce GTX 1070, pci bus id: 0000:04:00.0, compute capability: 6.1)
2020-02-20 22:24:25.234623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 7624 MB memory) -> physical GPU (d
evice: 3, name: GeForce GTX 1070, pci bus id: 0000:05:00.0, compute capability: 6.1)
2020-02-20 22:24:25.234832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 7624 MB memory) -> physical GPU (d
evice: 4, name: GeForce GTX 1070, pci bus id: 0000:07:00.0, compute capability: 6.1)
2020-02-20 22:24:25.234949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 7624 MB memory) -> physical GPU (d
evice: 5, name: GeForce GTX 1070, pci bus id: 0000:08:00.0, compute capability: 6.1)
Second: The code as-is does work; I have successfully run the training model, but of course it only used the first GPU.
The code I'm using is a GAN project and uses a 'with' block for training:
with tf.Session() as session:
# Time stamp
localtime = time.asctime( time.localtime(time.time()) )
print("Starting TensorFlow session...")
print("Local current time :", localtime)
# Start TensorFlow session...
session.run(tf.global_variables_initializer())
.
.
.
I've been going in circles (and crazy) trying to figure out how to use the recommended tf.distribute.MirroredStrategy() to do parallel training across my GPUs. Everything I've come across so far leads in circles or stops short of applicable examples.
Is there a straightforward way to modify the session code to use the mirrored strategy? Is there just a more basic way to get the session calls to train across multiple GPUs?
It's doable, but the short answer to this is no. There's no straightforward way.
Everything I've been able to find requires a good understanding of tensorflow and a fair amount of work to port a 'stock' tensorflow session over to multi GPU. It requires running multiple sessions with assigned GPUs and figuring out how to coordinate the training data between everything.
Switching to the newer strategy paradigm (or keras multigpu) requires figuring out how to express the learning model in layers vs sessions. Again, something that requires a pretty solid handle on TF.
If you're starting from scratch, I'd say look into keras or the mirrored strategy from the beginning.
Managing sessions and coordinating data is a bit of a headache (thats why there's nice wrappers now) but it's been done a lot.
If you're a beginner like me let me save you some frustration; it's a big task either go a different way or buckle up no easy answer.
https://jhui.github.io/2017/03/07/TensorFlow-GPU/

How can i make Tensor flow train.py use all the available GPU's?

I am running tensorflow 1.7 on my local machine which contains 2 GPU's each of around 8 GB.
Training(train.py) the Object detection is working when i am using the model 'faster_rcnn_resnet101_coco'. But When i tried to run 'faster_rcnn_nas_coco' its showing an error of 'Resource exhausted'
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py:736: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-05-02 16:14:53.963966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0, 1
2018-05-02 16:14:53.964071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-02 16:14:53.964083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 1
2018-05-02 16:14:53.964091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N Y
2018-05-02 16:14:53.964097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 1: Y N
2018-05-02 16:14:53.964566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7385 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-05-02 16:14:53.966360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7552 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1070, pci bus id: 0000:03:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from training/model.ckpt-0
INFO:tensorflow:Restoring parameters from training/model.ckpt-0
Limit: 7744048333
InUse: 7699536896
MaxInUse: 7699551744
NumAllocs: 10260
MaxAllocSize: 4076716032
2018-05-02 16:16:52.223943: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ***********************************************************************************x****************
2018-05-02 16:16:52.223967: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at depthwise_conv_op.cc:358 : Resource exhausted: OOM when allocating tensor with shape[64,672,9,9] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
I am not sure !! If its using both GPU's because it showing in usage memory as '7699536896'. After going through the train.py, I also tried
python train.py \
--logtostderr \
--worker_replicas=2 \
--pipeline_config_path=training/faster_rcnn_resnet101_coco.config \
--train_dir=training
If 2 GPU's are available, Does tensorflow by default choose both of them ? or does it need any arguments ?
We use the number of GPU specified by worker_replicas. For the NASNet case, try decreasing the batch-size to make the network fit on GPU.

Tensorflow crashes my system while retraining ssd_mobilenet

The system configuration is as follows:
Ubuntu 16.04,
cuda 9.1,
cudnn 7.0.5,
nvidia driver 390.30,
GTX 1050 TI gpu,
tensorflow-gpu 1.7rc1 and 1.5,
configuration file and train.py is stock from the tensorflow distribution, and
the trained model being used is ssd_mobilenet_v1_coco_2017_11_17.
The following is collected from the putty terminal session:
(od) gennis#AI:~/models/research$ ./train_raccoon.sh
WARNING:tensorflow:From /home/dennis/.virtualenvs/od/local/lib/python2.7/site-packages/tensorflow/contrib/learn/python/learn/datasets/base.py:198: retry (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use the retry module or similar alternatives.
WARNING:tensorflow:From /home/dennis/models/research/object_detection/trainer.py:228: create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.create_global_step
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:depth of additional conv before box predictor: 0
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
WARNING:tensorflow:From /home/dennis/.virtualenvs/od/local/lib/python2.7/site-packages/tensorflow/contrib/slim/python/slim/learning.py:736: init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-03-23 16:59:01.725435: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1355] Found device 0 with properties:
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.455
pciBusID: 0000:01:00.0
totalMemory: 3.94GiB freeMemory: 3.89GiB
2018-03-23 16:59:01.725484: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1434] Adding visible gpu devices: 0
2018-03-23 16:59:02.090533: I tensorflow/core/common_runtime/gpu/gpu_device.cc:922] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-23 16:59:02.090592: I tensorflow/core/common_runtime/gpu/gpu_device.cc:928] 0
2018-03-23 16:59:02.090601: I tensorflow/core/common_runtime/gpu/gpu_device.cc:941] 0: N
2018-03-23 16:59:02.090801: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1052] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3631 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from /home/dennis/models/research/ssd_mobilenet_v1_coco_2017_11_17/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path temp/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
At this point the system crashes and I have to power down the system and then power it back up(reboot).
In addition to my training data and configuration, I also used the model and data from an article by Dat Tran called "How to train your object detector with Tensorflow's object detection API" and got the same results.
I have been able to run the mnist example and other tests that show that tensorflow-gpu is working.
I am not sure what to do next. Is there additional information I can gather to help further diagnose the problem?
Any advice would be greatly appreciated,
Thanks

multiple GPUs keras weird speedup

I did implement a similar code like the multi GPU code from keras
(multiGPU tutorial). When running this on a Server with 2 GPUs I have the following training times per epoch:
showing Keras only one GPU and setting variable gpus = 1 (only use one GPU), one epoch = 32s
showing Keras two GPUs, and gpus = 1, one epoch = 31 s
showing Keras two GPUs, and gpus = 2, one epoch = 37 s
the output looks a bit strange, while initializing the code seems to create multiple Tensorflow devices per GPU, I'm not sure if this is the correct behavior. But the most other examples I saw had just one such line per GPU.
first test (one GPU shown, gpus = 1):
2017-12-04 14:54:04.071549: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla P100-PCIE-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.3285
pciBusID 0000:82:00.0
Total memory: 15.93GiB
Free memory: 15.64GiB
2017-12-04 14:54:04.071597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-12-04 14:54:04.071605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-12-04 14:54:04.071619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:54:21.531654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
second test (2 GPU shown, gpus = 1):
2017-12-04 14:48:24.881733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties:
...(same as earlier)
2017-12-04 14:48:24.882924: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:48:24.882931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0)
2017-12-04 14:48:42.353807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:48:42.353851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0)
and weirdly for example 3 (gpus = 2):
2017-12-04 14:41:35.906828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties:
...(same as earlier)
2017-12-04 14:41:35.907996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:41:35.908002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0)
2017-12-04 14:41:52.944335: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:41:52.944377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0)
2017-12-04 14:41:53.709812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:41:53.709838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0)
the code:
LSTM = keras.layers.CuDNNLSTM
model.add(LSTM(knots, input_shape=(timesteps, X_train.shape[-1]), return_sequences=True))
model.add(LSTM(knots))
model.add(Dense(3, activation='softmax'))
if gpus>=2:
model_basic = model
with tf.device("/cpu:0"):
model = model_basic
parallel_model = multi_gpu_model(model, gpus=gpus)
model = parallel_model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
hist = model.fit(myParameter)
Is this a typical behavior? What is wrong with my code that the multiple devices per GPU are created. Thanks in advance.
I tried the exact code of multiGPU tutorial.
It looks like its somehow the expected output. But to see the expected speed differences I had to increase the number of samples (20000) and needed to height and width to 100 (due to RAM limits).
I'm not completely sure why in my case I didn't see a speedup with two GPU. I expect it to be due to limits of the memory speed. Because my batch size is rather small and each sample is also small. This leads to the effect that the managing of the data needs more time than the actual calculation.
The distribution of the data gets even more time consuming when using 2 GPUs, while the actual runtime on each GPU decreases.
This effect could be proven if I could check the utilization of the graphics cards. Sadly I don't know how to do this.
If anyone has other ideas on this, let me know. Thanks

Is it Ok that creating TensorFlow device multiple times

I've run a image processing script using tensorflow API. It turns out that the processing time decreased quickly when I set the for-loop outside the session running procedure. Could anyone tell me why? Is there any side-effects?
The original code:
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(len(file_list)):
start = time.time()
image_crop, bboxs_crop = sess.run(crop_image(file_list[i], bboxs_list[i], sess))
print( 'Done image %d th in %d ms \n'% (i, ((time.time() - start)*1000)))
# image_crop, bboxs_crop, image_debug = sess.run(crop_image(file_list[i], bboxs_list[i], sess))
labels, bboxs = filter_bbox(labels_list[i], bboxs_crop)
# Image._show(Image.fromarray(np.asarray(image_crop)))
# Image._show(Image.fromarray(np.asarray(image_debug)))
save_image(image_crop, ntpath.basename(file_list[i]))
#save_desc_file(file_list[i], labels_list[i], bboxs_crop)
save_desc_file(file_list[i], labels, bboxs)
coord.request_stop()
coord.join(threads)
The code modified:
for i in range(len(file_list)):
with tf.Graph().as_default(), tf.Session() as sess:
start = time.time()
image_crop, bboxs_crop = sess.run(crop_image(file_list[i], bboxs_list[i], sess))
print( 'Done image %d th in %d ms \n'% (i, ((time.time() - start)*1000)))
labels, bboxs = filter_bbox(labels_list[i], bboxs_crop)
save_image(image_crop, ntpath.basename(file_list[i]))
save_desc_file(file_list[i], labels, bboxs)
The time cost in the original code would keep increasing from 200ms to even 20000ms. While after modified, the the logs messages indicate that there are more than one graph and tensorflow devices were created during running, why is that?
python random_crop_images_hongyuan.py I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcublas.so.8.0 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcudnn.so.5 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcufft.so.8.0 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcuda.so.1 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcurand.so.8.0 locally W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE3 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.1 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.2 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use AVX instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use AVX2 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use FMA instructions, but these are
available on your machine and could speed up CPU computations. I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful
NUMA node read from SysFS had negative value (-1), but there must be
at least one NUMA node, so returning NUMA node zero I
tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0
with properties: name: GeForce GT 730M major: 3 minor: 5
memoryClockRate (GHz) 0.758 pciBusID 0000:01:00.0 Total memory:
982.88MiB Free memory: 592.44MiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I
tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I
tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3000 th in 317 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3001 th in 325 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3002 th in 312 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3003 th in 147 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3004 th in 447 ms
My guess is that this happens because creating the session is an expensive operation. May be it could also happen that the session is not properly cleaned when the with-statement is left, so each new allocation on the device will have less resources available. In short, I would not recommend doing it this way, rather initialize just one session and try to reuse it.
EDIT:
In answer to your comment: The session is closed automatically as soon as the with-block is exited. I've read in this github issue that the memory on the GPU is only really released when the whole program exits. But I guess that when you allocate a new session after you closed the last one, Tensorflow will internally just re-use the previously allocated resources. So, in retrospective my answer is probably not very insightful. Sorry if I caused confusion.
It's not possible to be 100% certain without seeing all of your code, but I would guess that the crop_image() function is calling various TensorFlow op functions to build a graph.
It is almost never a good idea to build a graph inside a for loop. This answer explains why: some operations (such as the first Session.run() call to a new operation) take time that is linear in the number of operations in the graph. If you add more operations in each iteration, iteration i will do work that is linear in i, and so the overall execution time will be quadratic.
The modified version of your code (with a with tf.Graph().as_default(): block inside the loop) will be faster because it creates a new, empty tf.Graph in each iteration, and therefore each iteration does a constant amount of work.
An even more efficient solution would be to build the graph and session once, using tf.placeholder() tensors to represent the filename and bbox arguments to crop_image, and feeding different values to these placeholders in each iteration.