I'm pretty new to Tensorflow and I'll be the first to admit I'm a bit confused and turned around and might very well be barking up the wrong tree.
First: This is NOT a question about getting my GPUs working and seen by tensorflow(TF); I have verified from inside the container the GPU's are detected by TF. (using tensorflow/tensorflow:1.13.1-gpu-py3)
2020-02-20 22:24:25.233916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-02-20 22:24:25.233933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0 1 2 3 4 5
2020-02-20 22:24:25.233939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N Y Y Y Y Y
2020-02-20 22:24:25.233943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 1: Y N Y Y Y Y
2020-02-20 22:24:25.233947: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 2: Y Y N Y Y Y
2020-02-20 22:24:25.233950: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 3: Y Y Y N Y Y
2020-02-20 22:24:25.233954: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 4: Y Y Y Y N Y
2020-02-20 22:24:25.233958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 5: Y Y Y Y Y N
2020-02-20 22:24:25.234135: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7623 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0, compute capability: 6.1)
2020-02-20 22:24:25.234370: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7624 MB memory) -> physical GPU (d
evice: 1, name: GeForce GTX 1070, pci bus id: 0000:02:00.0, compute capability: 6.1)
2020-02-20 22:24:25.234516: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 7624 MB memory) -> physical GPU (d
evice: 2, name: GeForce GTX 1070, pci bus id: 0000:04:00.0, compute capability: 6.1)
2020-02-20 22:24:25.234623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 7624 MB memory) -> physical GPU (d
evice: 3, name: GeForce GTX 1070, pci bus id: 0000:05:00.0, compute capability: 6.1)
2020-02-20 22:24:25.234832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:4 with 7624 MB memory) -> physical GPU (d
evice: 4, name: GeForce GTX 1070, pci bus id: 0000:07:00.0, compute capability: 6.1)
2020-02-20 22:24:25.234949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:5 with 7624 MB memory) -> physical GPU (d
evice: 5, name: GeForce GTX 1070, pci bus id: 0000:08:00.0, compute capability: 6.1)
Second: The code as-is does work; I have successfully run the training model, but of course it only used the first GPU.
The code I'm using is a GAN project and uses a 'with' block for training:
with tf.Session() as session:
# Time stamp
localtime = time.asctime( time.localtime(time.time()) )
print("Starting TensorFlow session...")
print("Local current time :", localtime)
# Start TensorFlow session...
session.run(tf.global_variables_initializer())
.
.
.
I've been going in circles (and crazy) trying to figure out how to use the recommended tf.distribute.MirroredStrategy() to do parallel training across my GPUs. Everything I've come across so far leads in circles or stops short of applicable examples.
Is there a straightforward way to modify the session code to use the mirrored strategy? Is there just a more basic way to get the session calls to train across multiple GPUs?
It's doable, but the short answer to this is no. There's no straightforward way.
Everything I've been able to find requires a good understanding of tensorflow and a fair amount of work to port a 'stock' tensorflow session over to multi GPU. It requires running multiple sessions with assigned GPUs and figuring out how to coordinate the training data between everything.
Switching to the newer strategy paradigm (or keras multigpu) requires figuring out how to express the learning model in layers vs sessions. Again, something that requires a pretty solid handle on TF.
If you're starting from scratch, I'd say look into keras or the mirrored strategy from the beginning.
Managing sessions and coordinating data is a bit of a headache (thats why there's nice wrappers now) but it's been done a lot.
If you're a beginner like me let me save you some frustration; it's a big task either go a different way or buckle up no easy answer.
https://jhui.github.io/2017/03/07/TensorFlow-GPU/
Related
(I'am sorry if this question is too novice but as I don't quite understand and want to double check whether I am using two gpus in parallel in a correct manner, I ask you the following question.)
Two gpus (with the same model) are installed in the pc I am using. In a pycharm project, I run a tensorflow code setting
os.environ["CUDA_VISIBLE_DEVICES"] = '0'
, which then runs with a initial run log
Using TensorFlow backend.
2018-09-15 03:36:36.727152: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-09-15 03:36:37.080157: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:17:00.0
totalMemory: 11.00GiB freeMemory: 9.08GiB
2018-09-15 03:36:37.080671: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-15 03:36:37.796088: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-15 03:36:37.796320: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0
2018-09-15 03:36:37.796469: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0: N
2018-09-15 03:36:37.796723: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8783 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1)
Then in another pycharm project, I run a tensorflow code setting
os.environ["CUDA_VISIBLE_DEVICES"] = '1'
which then shows in run log
Using TensorFlow backend.
2018-09-15 03:37:00.119630: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-09-15 03:37:00.468546: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:65:00.0
totalMemory: 11.00GiB freeMemory: 9.08GiB
2018-09-15 03:37:00.468930: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0
2018-09-15 03:37:01.199726: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-15 03:37:01.199950: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0
2018-09-15 03:37:01.200096: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0: N
2018-09-15 03:37:01.200349: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8783 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
What worries me that they have both device 0. But their pciBusIDs are different.
So my simple question is am I using two gpus in parallel in a correct manner?
As I am using Windows 10, I monitored gpu usages with Device Manager, and it seems correct to me. But I just want to hear from experts.
And if it is okay for you to answer, what is pci bus ID, roughly? And why are both of them showing device 0?
No, you have to add
with tf.device("your_device_name")
Just follow this tutorial, section Using Multiple GPUs https://www.tensorflow.org/guide/using_gpu
I have two gpus installed in my pc as they are to be used in parallel (without any SLI or likes). Suppose I run a simple code in tensorflow like linear regression as in this. Then which gpu is used? Are both of them used? Here is run log.
2018-09-15 02:55:36.314345: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-09-15 02:55:36.675657: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:17:00.0
totalMemory: 11.00GiB freeMemory: 9.08GiB
2018-09-15 02:55:36.798520: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1405] Found device 1 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:65:00.0
totalMemory: 11.00GiB freeMemory: 9.08GiB
2018-09-15 02:55:36.799044: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1484] Adding visible gpu devices: 0, 1
2018-09-15 02:55:38.234984: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-09-15 02:55:38.235236: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:971] 0 1
2018-09-15 02:55:38.235392: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 0: N N
2018-09-15 02:55:38.235559: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:984] 1: N N
2018-09-15 02:55:38.235849: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8783 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:17:00.0, compute capability: 6.1)
2018-09-15 02:55:38.601267: I T:\src\github\tensorflow\tensorflow\core\common_runtime\gpu\gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8783 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
Tensorflow's default is to consume the memory of all visible GPUs, but unless you code for using multiple GPUs only the first of the two will be used for computation.
You would typically set the environment variable export CUDA_VISIBLE_DEVICES=0prior to running python to limit tensorflow to only seeing gpu0, for example. (0=gpu0, 1=gpu1, etc, -1=cpu only)
Using both GPUs for computation requires that you code for multiple GPUs (and make decisions about what that means in your model), there are many tutorials on the topic, here's one quick one I pulled up: http://blog.s-schoener.com/2017-12-15-parallel-tensorflow-intro/
I am running tensorflow 1.7 on my local machine which contains 2 GPU's each of around 8 GB.
Training(train.py) the Object detection is working when i am using the model 'faster_rcnn_resnet101_coco'. But When i tried to run 'faster_rcnn_nas_coco' its showing an error of 'Resource exhausted'
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py:736: __init__ (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version.
Instructions for updating:
Please switch to tf.train.MonitoredTrainingSession
2018-05-02 16:14:53.963966: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0, 1
2018-05-02 16:14:53.964071: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-05-02 16:14:53.964083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0 1
2018-05-02 16:14:53.964091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N Y
2018-05-02 16:14:53.964097: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 1: Y N
2018-05-02 16:14:53.964566: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7385 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:02:00.0, compute capability: 6.1)
2018-05-02 16:14:53.966360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 7552 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1070, pci bus id: 0000:03:00.0, compute capability: 6.1)
INFO:tensorflow:Restoring parameters from training/model.ckpt-0
INFO:tensorflow:Restoring parameters from training/model.ckpt-0
Limit: 7744048333
InUse: 7699536896
MaxInUse: 7699551744
NumAllocs: 10260
MaxAllocSize: 4076716032
2018-05-02 16:16:52.223943: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ***********************************************************************************x****************
2018-05-02 16:16:52.223967: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at depthwise_conv_op.cc:358 : Resource exhausted: OOM when allocating tensor with shape[64,672,9,9] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
I am not sure !! If its using both GPU's because it showing in usage memory as '7699536896'. After going through the train.py, I also tried
python train.py \
--logtostderr \
--worker_replicas=2 \
--pipeline_config_path=training/faster_rcnn_resnet101_coco.config \
--train_dir=training
If 2 GPU's are available, Does tensorflow by default choose both of them ? or does it need any arguments ?
We use the number of GPU specified by worker_replicas. For the NASNet case, try decreasing the batch-size to make the network fit on GPU.
I did implement a similar code like the multi GPU code from keras
(multiGPU tutorial). When running this on a Server with 2 GPUs I have the following training times per epoch:
showing Keras only one GPU and setting variable gpus = 1 (only use one GPU), one epoch = 32s
showing Keras two GPUs, and gpus = 1, one epoch = 31 s
showing Keras two GPUs, and gpus = 2, one epoch = 37 s
the output looks a bit strange, while initializing the code seems to create multiple Tensorflow devices per GPU, I'm not sure if this is the correct behavior. But the most other examples I saw had just one such line per GPU.
first test (one GPU shown, gpus = 1):
2017-12-04 14:54:04.071549: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties:
name: Tesla P100-PCIE-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.3285
pciBusID 0000:82:00.0
Total memory: 15.93GiB
Free memory: 15.64GiB
2017-12-04 14:54:04.071597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-12-04 14:54:04.071605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-12-04 14:54:04.071619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:54:21.531654: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
second test (2 GPU shown, gpus = 1):
2017-12-04 14:48:24.881733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties:
...(same as earlier)
2017-12-04 14:48:24.882924: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:48:24.882931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0)
2017-12-04 14:48:42.353807: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:48:42.353851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0)
and weirdly for example 3 (gpus = 2):
2017-12-04 14:41:35.906828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties:
...(same as earlier)
2017-12-04 14:41:35.907996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:41:35.908002: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0)
2017-12-04 14:41:52.944335: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:41:52.944377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0)
2017-12-04 14:41:53.709812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0)
2017-12-04 14:41:53.709838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:83:00.0)
the code:
LSTM = keras.layers.CuDNNLSTM
model.add(LSTM(knots, input_shape=(timesteps, X_train.shape[-1]), return_sequences=True))
model.add(LSTM(knots))
model.add(Dense(3, activation='softmax'))
if gpus>=2:
model_basic = model
with tf.device("/cpu:0"):
model = model_basic
parallel_model = multi_gpu_model(model, gpus=gpus)
model = parallel_model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
hist = model.fit(myParameter)
Is this a typical behavior? What is wrong with my code that the multiple devices per GPU are created. Thanks in advance.
I tried the exact code of multiGPU tutorial.
It looks like its somehow the expected output. But to see the expected speed differences I had to increase the number of samples (20000) and needed to height and width to 100 (due to RAM limits).
I'm not completely sure why in my case I didn't see a speedup with two GPU. I expect it to be due to limits of the memory speed. Because my batch size is rather small and each sample is also small. This leads to the effect that the managing of the data needs more time than the actual calculation.
The distribution of the data gets even more time consuming when using 2 GPUs, while the actual runtime on each GPU decreases.
This effect could be proven if I could check the utilization of the graphics cards. Sadly I don't know how to do this.
If anyone has other ideas on this, let me know. Thanks
I am trying to train my model on a GPU instead of a CPU on an AWS p2.xlarge instance from my Jupyter Notebook. I am using tensorflow-gpu backend (only tensorflow-gpu was installed and mentioned in requirements.txt and not tensorflow).
I am not seeing any speed improvements when training models on these instances compared to using a CPU, infact I am getting training speeds per epoch that is almost same as I am getting on my 4-core laptop CPU (p2.xlarge also has 4 vCPUs with a Tesla K80 GPU). I am not sure if i need to do some changes to my code to accommodate faster/parallel processing that GPU can offer. I am pasting below my code for my model:
model = Sequential()
model.add(recurrent.LSTM(64, input_shape=(X_np.shape[1], X_np.shape[2]),
return_sequences=True))
model.add(recurrent.LSTM(64, return_sequences = False))
model.add(core.Dropout(0.1))
model.add(core.Dense(3, activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer = 'rmsprop', metrics=['accuracy'])
model.fit(X_np, y_np, epochs=100, validation_split=0.25)
Also interestingly the GPU seems to be utilizing between 50%-60% of its processing power and almost all of its memory every time I check for GPU status using nvidia-smi (but both fall to 0% and 1MiB respectively when not training):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81 Driver Version: 384.81 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 |
| N/A 47C P0 73W / 149W | 10919MiB / 11439MiB | 52% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1665 C ...ubuntu/aDash/MLenv/bin/python 10906MiB |
+-----------------------------------------------------------------------------+
Also if you'd like to see my logs about using the GPU from Jupyter Notebook:
[I 04:21:59.390 NotebookApp] Kernel started: c17bc4d1-fa15-4b0e-b5f0-87f90e56bf65
[I 04:22:02.241 NotebookApp] Adapting to protocol v5.1 for kernel c17bc4d1-fa15-4b0e-b5f0-87f90e56bf65
2017-11-30 04:22:32.403981: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-11-30 04:22:33.653681: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-11-30 04:22:33.654041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2017-11-30 04:22:33.654070: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2017-11-30 04:22:34.014329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
2017-11-30 04:22:34.015339: I tensorflow/core/common_runtime/direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7
2017-11-30 04:23:22.426895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
Please suggest what could be the problem. Thanks a ton for looking at this anyways!
That happens because you're using LSTM layers.
Tensorflow's implementation for LSTM layers is not that great. The reason is probably that recurrent calculations are not parallel calculations, and GPUs are great for parallel processing.
I confirmed that by my own experience:
Got terrible speed using LSTMs in my model
Decided to test the model removing all LSTMs (got a pure convolutional model)
The resulting speed was simply astonishing!!!
This article about using GPUs and tensorflow also confirms that:
http://minimaxir.com/2017/07/cpu-or-gpu/
A possible solution?
You may try using the new CuDNNLSTM, which seems prepared specially for using GPUs.
I never tested it, but you'll most probably get a better performance with this.
Another thing that I haven't tested, and I'm not sure it's designed for that reason, but I suspect it is: you can put unroll=True in your LSTM layers. With that, I suspect the recurrent calculations will be transformed in parallel ones.
Try to use some bigger value for batch_size in model.fit, because the default is 32. Increase it until you get 100% CPU utilization.
Following suggestion from #dgumo, you can also put your data into /run/shm. This is an in-memory file system, which allows to access data in fastest possible way. Alternatively, you can ensure that your data resides at least on SSD. For example in /tmp.
The bottleneck in your case is transferring data to and from the GPU. The best way to speed up your computation (and maximize your GPU usage) is to load as much of your data as your memory can hold at once. Since you have plenty of memory, you can put all your data at once, by doing:
model.fit(X_np, y_np, epochs=100, validation_split=0.25, batch_size=X_np.shape[0])
(You should also probably increase the number of epochs when doing this).
Note however that there are advantages to minibatching (e.g. better handling of local minima), so you should probably consider choosing a batch_size somewhere in between.