Tensorflow can not run integer matrix multiplication on GPU - tensorflow

Take a look at this example, where I attempt to multiple two tf.int32 matrices using my GPU.
import tensorflow as tf
matrix1 = tf.constant([[3,3]])
matrix2 = tf.constant([[2],[2]])
with tf.device("/gpu:0"):
product = tf.matmul(matrix1,matrix2)
with tf.Session() as sess:
result = sess.run(product)
print(result)
It is similar to the example found on https://www.tensorflow.org/versions/r0.10/get_started/basic_usage.html
I get the output:
...
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7715
pciBusID 0000:03:00.0
Total memory: 7.92GiB
Free memory: 213.62MiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:839] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:03:00.0)
E tensorflow/core/client/tensor_c_api.cc:485] Cannot assign a device to node 'MatMul': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.
[[Node: MatMul = MatMul[T=DT_INT32, transpose_a=false, transpose_b=false, _device="/device:GPU:0"](Const, Const_1)]]
Why can't I perform a matrix multiplication on the GPU? I can fix this by using allow_soft_placement = True, but I would like to do this on the GPU..

Integer multiplication is currently not implemented for the GPU in TensorFlow, and your matrices matrix1 and matrix2 have type tf.int32. (It turns out that it is easy to implement but, for various reasons discussed in this answer, TensorFlow doesn't include op registrations for tf.int32 on GPU devices.)
Assuming you are actually interested in multiplying (much larger) floating-point matrices, you can change your program to:
import tensorflow as tf
matrix1 = tf.constant([[3., 3.]])
matrix2 = tf.constant([[2.], [2.]])
with tf.device("/gpu:0"):
product = tf.matmul(matrix1,matrix2)
with tf.Session() as sess:
result = sess.run(product)
print(result)
...and the multiplication will execute on your GPU.

Related

SageMaker fails when using Multi-GPU with keras.utils.multi_gpu_model

Running AWS SageMaker with a custom model, the TrainingJob fails with an Algorithm Error when using Keras plus a Tensorflow backend in multi-gpu configuration:
from keras.utils import multi_gpu_model
parallel_model = multi_gpu_model(model, gpus=K)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
parallel_model.fit(x, y, epochs=20, batch_size=256)
This simple parallel model loading will fail. There is no further error or exception from CloudWatch logging. This configuration works properly on local machine with 2x NVIDIA GTX 1080, same Keras Tensorflow backend.
According to SageMaker documentation and tutorials the multi_gpu_model utility will work ok when Keras backend is MXNet, but I did not find any mention when the backend is Tensorflow with the same multi gpu configuration.
[UPDATE]
I have updated the code with the suggested answer below, and I'm adding some logging before the TrainingJob hangs
This logging repeats twice
2018-11-27 10:02:49.878414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2018-11-27 10:02:49.878462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-27 10:02:49.878471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2018-11-27 10:02:49.878477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y
2018-11-27 10:02:49.878481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y
2018-11-27 10:02:49.878486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y
2018-11-27 10:02:49.878492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N
2018-11-27 10:02:49.879340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 14874 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1b.0, compute capability: 7.0)
2018-11-27 10:02:49.879486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 14874 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1c.0, compute capability: 7.0)
2018-11-27 10:02:49.879694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:2 with 14874 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1d.0, compute capability: 7.0)
2018-11-27 10:02:49.879872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:3 with 14874 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)
Before there is some logging info about each GPU, that repeats 4 times
2018-11-27 10:02:46.447639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.78GiB freeMemory: 15.37GiB
According to the logging all the 4 GPUs are visible and loaded in the Tensorflow Keras backend. After that no application logging follows, the TrainingJob status is inProgress for a while, after that it becomes Failed with the same Algorithm Error.
Looking at CloudWatch logging I can see some metrics at work. Specifically GPU Memory Utilization, CPU Utilization are ok, while GPU utilization is 0%.
[UPDATE]
Due to a known bug on Keras that is about saving a multi gpu model, I'm using this override of the multi_gpu_model utility in keras.utils
from keras.layers import Lambda, concatenate
from keras import Model
import tensorflow as tf
def multi_gpu_model(model, gpus):
#source: https://github.com/keras-team/keras/issues/8123#issuecomment-354857044
if isinstance(gpus, (list, tuple)):
num_gpus = len(gpus)
target_gpu_ids = gpus
else:
num_gpus = gpus
target_gpu_ids = range(num_gpus)
def get_slice(data, i, parts):
shape = tf.shape(data)
batch_size = shape[:1]
input_shape = shape[1:]
step = batch_size // parts
if i == num_gpus - 1:
size = batch_size - step * i
else:
size = step
size = tf.concat([size, input_shape], axis=0)
stride = tf.concat([step, input_shape * 0], axis=0)
start = stride * i
return tf.slice(data, start, size)
all_outputs = []
for i in range(len(model.outputs)):
all_outputs.append([])
# Place a copy of the model on each GPU,
# each getting a slice of the inputs.
for i, gpu_id in enumerate(target_gpu_ids):
with tf.device('/gpu:%d' % gpu_id):
with tf.name_scope('replica_%d' % gpu_id):
inputs = []
# Retrieve a slice of the input.
for x in model.inputs:
input_shape = tuple(x.get_shape().as_list())[1:]
slice_i = Lambda(get_slice,
output_shape=input_shape,
arguments={'i': i,
'parts': num_gpus})(x)
inputs.append(slice_i)
# Apply model on slice
# (creating a model replica on the target device).
outputs = model(inputs)
if not isinstance(outputs, list):
outputs = [outputs]
# Save the outputs for merging back together later.
for o in range(len(outputs)):
all_outputs[o].append(outputs[o])
# Merge outputs on CPU.
with tf.device('/cpu:0'):
merged = []
for name, outputs in zip(model.output_names, all_outputs):
merged.append(concatenate(outputs,
axis=0, name=name))
return Model(model.inputs, merged)
This works ok on local 2x NVIDIA GTX 1080 / Intel Xeon / Ubuntu 16.04. It will fails on SageMaker Training Job.
I have posted this issue on AWS Sagemaker forum in
TrainingJob custom algorithm with Keras backend and multi GPU
SageMaker Fails when using Multi-GPU with
keras.utils.multi_gpu_model
[UPDATE]
I have slightly modified the tf.session code adding some initializers
with tf.Session() as session:
K.set_session(session)
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
and now at least I can see that one GPU (I assume device gpu:0) is used from the instance metrics. The multi-gpu does not work anyways.
This might not be the best answer for your problem, but this is what I am using for a multi-gpu model with Tensorflow backend. First i initialize using:
def setup_multi_gpus():
"""
Setup multi GPU usage
Example usage:
model = Sequential()
...
multi_model = multi_gpu_model(model, gpus=num_gpu)
multi_model.fit()
About memory usage:
https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory
"""
import tensorflow as tf
from keras.utils.training_utils import multi_gpu_model
from tensorflow.python.client import device_lib
# IMPORTANT: Tells tf to not occupy a specific amount of memory
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True # dynamically grow the memory used on the GPU
sess = tf.Session(config=config)
set_session(sess) # set this TensorFlow session as the default session for Keras.
# getting the number of GPUs
def get_available_gpus():
local_device_protos = device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
num_gpu = len(get_available_gpus())
print('Amount of GPUs available: %s' % num_gpu)
return num_gpu
Then i call
# Setup multi GPU usage
num_gpu = setup_multi_gpus()
and create a model.
...
After which you're able to make it a multi GPU model.
multi_model = multi_gpu_model(model, gpus=num_gpu)
multi_model.compile...
multi_model.fit...
The only thing here that is different from what you are doing is the way Tensorflow is initializing the GPU's. I can't imagine it being the problem, but it might be worth trying out.
Good luck!
Edit: I noticed sequence to sequence not being able to work with multi GPU. Is that the type of model you are trying to train?
I apologize for the slow response.
It seems there are a lot of threads that are running in parallel, and I want to link them together, so that other individuals who have the same issue can see the progress and discussion going on.
https://forums.aws.amazon.com/thread.jspa?messageID=881541
https://forums.aws.amazon.com/thread.jspa?messageID=881540
https://github.com/aws/sagemaker-python-sdk/issues/512
There a few questions in regards to this.
What version of TensorFlow and Keras?
I am not too sure what is causing this problem. Does your container have all of the needed dependencies such as CUDA and etc? https://www.tensorflow.org/install/gpu
Were you able to train using single GPU with Keras?

Does Tensorflow calculate these partial derivatives/gradient correctly?

I'm just starting learning Tensorflow and came across an example that doesn't make sense to me:
>>> import tensorflow as tf
>>> a=tf.Variable(1.)
>>> b=2*a
>>> c=a+b
>>> g=tf.gradients(c, [a,b])
>>> sess=tf.Session()
2018-09-20 13:50:59.616341: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-09-20 13:50:59.616400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: GeForce GTX 1080, pci bus id: 0000:02:00.0, compute capability: 6.1)
>>> print sess.run(g)
[3.0, 1.0]
Since c=3a, I expected the first partial (partial c with respect to a) to be 3.0. But, it is also true that c=1.5b, so I expected the second partial derivative to be 1.5, not 1.0.
On the other hand, if I do the following:
>>> b = tf.Variable(2.)
>>> a = 0.5*b
>>> c = a+b
>>> g = tf.gradients(c,[a,b])
I get this result:
>>> print sess.run(g)
[1.0, 1.5]
I have similar problems with this answer.
Additionally, I would think I'm looking for the same information about the same function at the same point with the same constraint in these two cases. I would expect the same answers.
Have I forgotten something truly embarrassing about partial derivatives or algebra? Or, am I fundamentally misunderstanding something about what I can expect from from Tensorflow gradients?
Is there something to do with the graph construction that ends up creating a situation where b depends on a, but a is independent of b? Or, is the true problem that gradients should only be taken with respect to variables that are strictly independent of each other?

Google Cloud ML Engine GPUs error

I've created several jobs for training CNN using Google Cloud ML Engine,
each time job finished successfully with GPU error. The printed device placement included some GPU activity, but there was no GPU usage in job details/utilization.
Here is the command I use for create a job:
gcloud beta ml-engine jobs submit training fei_test34 --job-dir gs://tfoutput/joboutput --package-path trainer --module-name=trainer.main --region europe-west1 --staging-bucket gs://tfoutput --scale-tier BASIC_GPU -- --data=gs://crispdata/cars_128 --max_epochs=1 --train_log_dir=gs://tfoutput/joboutput --model=trainer.crisp_model_2x64_2xBN --validation=True -x
Here is the device placement log:
log device placement
. GPU error:
GPU error detail
More info:
When I ran my code on Google Cloud ML Engine, the average speed for training using one Tesla K80 was 8.2 example/sec, the average speed without using GPUs was 5.7 example/sec, with image size 112x112. Same code I got 130.4 example/sec using one GRID K520 on Amazon AWS. I thought that using Tesla K80 should get faster speed. Also, I got the GPU error I posted yesterday. Additionally, in Compute Engine Quotas, I can see the usage of CPU > 0%, but the usage of GPUs remains 0%. I was wondering whether GPU is really working.
I am not familiar with cloud computing, so not sure I've provided enough information. Feel free to ask for more details.
I just tried setting to complex_model_m_gpu, the training speed is about the same as one GPU (cause my code is for one GPU), but there is more information in the log. Here is the copy of the log:
I successfully opened CUDA library libcudnn.so.5 locally
I successfully opened CUDA library libcufft.so.8.0 locally
I successfully opened CUDA library libcuda.so.1 locally
I successfully opened CUDA library libcurand.so.8.0 locally
I Summary name cross_entropy (raw) is illegal; using cross_entropy__raw_ instead.
I Summary name total_loss (raw) is illegal; using total_loss__raw_ instead.
W The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I Found device 0 with properties:
E name: Tesla K80
E major: 3 minor: 7 memoryClockRate (GHz) 0.8235
E pciBusID 0000:00:04.0
E Total memory: 11.20GiB
E Free memory: 11.13GiB
W creating context when one is currently active; existing: 0x39ec240
I successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I Found device 1 with properties:
E name: Tesla K80
E major: 3 minor: 7 memoryClockRate (GHz) 0.8235
E pciBusID 0000:00:05.0
E Total memory: 11.20GiB
E Free memory: 11.13GiB
W creating context when one is currently active; existing: 0x39f00b0
I successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I Found device 2 with properties:
E name: Tesla K80
E major: 3 minor: 7 memoryClockRate (GHz) 0.8235
E pciBusID 0000:00:06.0
E Total memory: 11.20GiB
E Free memory: 11.13GiB
W creating context when one is currently active; existing: 0x3a148b0
I successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I Found device 3 with properties:
E name: Tesla K80
E major: 3 minor: 7 memoryClockRate (GHz) 0.8235
E pciBusID 0000:00:07.0
E Total memory: 11.20GiB
E Free memory: 11.13GiB
I Peer access not supported between device ordinals 0 and 1
I Peer access not supported between device ordinals 0 and 2
I Peer access not supported between device ordinals 0 and 3
I Peer access not supported between device ordinals 1 and 0
I Peer access not supported between device ordinals 1 and 2
I Peer access not supported between device ordinals 1 and 3
I Peer access not supported between device ordinals 2 and 0
I Peer access not supported between device ordinals 2 and 1
I Peer access not supported between device ordinals 2 and 3
I Peer access not supported between device ordinals 3 and 0
I Peer access not supported between device ordinals 3 and 1
I Peer access not supported between device ordinals 3 and 2
I DMA: 0 1 2 3
I 0: Y N N N
I 1: N Y N N
I 2: N N Y N
I 3: N N N Y
I Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0)
I Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:05.0)
I Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:00:06.0)
I Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:00:07.0)
I Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0)
I Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: 0000:00:05.0)
I Creating TensorFlow device (/gpu:2) -> (device: 2, name: Tesla K80, pci bus id: 0000:00:06.0)
I Creating TensorFlow device (/gpu:3) -> (device: 3, name: Tesla K80, pci bus id: 0000:00:07.0)
I 361
I bucket = crispdata, folder = cars_128/train
I path = gs://crispdata/cars_128/train
I Num examples = 240
I bucket = crispdata, folder = cars_128/val
I path = gs://crispdata/cars_128/val
I Num examples = 60
I {'flop': False, 'learning_rate_decay_factor': 0.005, 'train_log_dir': 'gs://tfoutput/joboutput/20170411_144221', 'valid_score_path': '/home/ubuntu/tensorflow/cifar10/validation_score.csv', 'saturate_epoch': 200, 'test_score_path': '', 'max_tries': 75, 'max_epochs': 10, 'id': '20170411_144221', 'test_data_size': 0, 'memory_usage': 0.3, 'load_size': 128, 'test_batch_size': 10, 'max_out_norm': 1.0, 'email_notify': False, 'skip_training': False, 'log_device_placement': False, 'learning_rate_decay_schedule': '', 'cpu_only': False, 'standardize': False, 'num_epochs_per_decay': 1, 'zoom_out': 0.0, 'val_data_size': 100, 'learning_rate': 0.1, 'grayscale': 0.0, 'train_data_size': 250, 'minimal_learning_rate': 1e-05, 'save_valid_scores': False, 'train_batch_size': 50, 'rotation': 0.0, 'val_epoch_size': 2, 'data': 'gs://crispdata/cars_128', 'val_batch_size': 50, 'num_classes': 2, 'learning_rate_decay': 'linear', 'random_seed': 5, 'num_threads': 1, 'num_gpus': 1, 'test_dir': '', 'shuffle_traindata': False, 'pca_jitter': 0.0, 'moving_average_decay': 1.0, 'sample_size': 128, 'job-dir': 'gs://tfoutput/joboutput', 'learning_algorithm': 'sgd', 'train_epoch_size': 5, 'model': 'trainer.crisp_model_2x64_2xBN', 'validation': False, 'tower_name': 'tower'}
I Filling queue with 100 CIFAR images before starting to train. This will take a few minutes.
I name: "train"
I op: "NoOp"
I input: "^GradientDescent"
I input: "^ExponentialMovingAverage"
I 128 128
I 2017-04-11 14:42:44.766116: epoch 0, loss = 0.71, lr = 0.100000 (5.3 examples/sec; 9.429 sec/batch)
I 2017-04-11 14:43:19.077377: epoch 1, loss = 0.53, lr = 0.099500 (8.1 examples/sec; 6.162 sec/batch)
I 2017-04-11 14:43:51.994015: epoch 2, loss = 0.40, lr = 0.099000 (7.7 examples/sec; 6.479 sec/batch)
I 2017-04-11 14:44:22.731741: epoch 3, loss = 0.39, lr = 0.098500 (8.2 examples/sec; 6.063 sec/batch)
I 2017-04-11 14:44:52.476539: epoch 4, loss = 0.24, lr = 0.098000 (8.4 examples/sec; 5.935 sec/batch)
I 2017-04-11 14:45:23.626918: epoch 5, loss = 0.29, lr = 0.097500 (8.1 examples/sec; 6.190 sec/batch)
I 2017-04-11 14:45:54.489606: epoch 6, loss = 0.56, lr = 0.097000 (8.6 examples/sec; 5.802 sec/batch)
I 2017-04-11 14:46:27.022781: epoch 7, loss = 0.12, lr = 0.096500 (6.4 examples/sec; 7.838 sec/batch)
I 2017-04-11 14:46:57.335240: epoch 8, loss = 0.25, lr = 0.096000 (8.7 examples/sec; 5.730 sec/batch)
I 2017-04-11 14:47:30.425189: epoch 9, loss = 0.11, lr = 0.095500 (7.8 examples/sec; 6.398 sec/batch)
Does this mean that GPUs are in use? If yes, any idea about the why there's a huge speed difference with Grid K520 when executing the same code?
So the log messages indicate that GPUs are available. To check whether GPUs are actually being used you can turn on logging of device placement to see which OPs are assigned to GPUs.
The Cloud Compute console won't show any utilization metrics related to Cloud ML Engine. If you look at the Cloud Console UI for your jobs you will see memory and CPU graphs but not GPU graphs.

Is it Ok that creating TensorFlow device multiple times

I've run a image processing script using tensorflow API. It turns out that the processing time decreased quickly when I set the for-loop outside the session running procedure. Could anyone tell me why? Is there any side-effects?
The original code:
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(len(file_list)):
start = time.time()
image_crop, bboxs_crop = sess.run(crop_image(file_list[i], bboxs_list[i], sess))
print( 'Done image %d th in %d ms \n'% (i, ((time.time() - start)*1000)))
# image_crop, bboxs_crop, image_debug = sess.run(crop_image(file_list[i], bboxs_list[i], sess))
labels, bboxs = filter_bbox(labels_list[i], bboxs_crop)
# Image._show(Image.fromarray(np.asarray(image_crop)))
# Image._show(Image.fromarray(np.asarray(image_debug)))
save_image(image_crop, ntpath.basename(file_list[i]))
#save_desc_file(file_list[i], labels_list[i], bboxs_crop)
save_desc_file(file_list[i], labels, bboxs)
coord.request_stop()
coord.join(threads)
The code modified:
for i in range(len(file_list)):
with tf.Graph().as_default(), tf.Session() as sess:
start = time.time()
image_crop, bboxs_crop = sess.run(crop_image(file_list[i], bboxs_list[i], sess))
print( 'Done image %d th in %d ms \n'% (i, ((time.time() - start)*1000)))
labels, bboxs = filter_bbox(labels_list[i], bboxs_crop)
save_image(image_crop, ntpath.basename(file_list[i]))
save_desc_file(file_list[i], labels, bboxs)
The time cost in the original code would keep increasing from 200ms to even 20000ms. While after modified, the the logs messages indicate that there are more than one graph and tensorflow devices were created during running, why is that?
python random_crop_images_hongyuan.py I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcublas.so.8.0 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcudnn.so.5 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcufft.so.8.0 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcuda.so.1 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcurand.so.8.0 locally W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE3 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.1 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.2 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use AVX instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use AVX2 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use FMA instructions, but these are
available on your machine and could speed up CPU computations. I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful
NUMA node read from SysFS had negative value (-1), but there must be
at least one NUMA node, so returning NUMA node zero I
tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0
with properties: name: GeForce GT 730M major: 3 minor: 5
memoryClockRate (GHz) 0.758 pciBusID 0000:01:00.0 Total memory:
982.88MiB Free memory: 592.44MiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I
tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I
tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3000 th in 317 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3001 th in 325 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3002 th in 312 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3003 th in 147 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3004 th in 447 ms
My guess is that this happens because creating the session is an expensive operation. May be it could also happen that the session is not properly cleaned when the with-statement is left, so each new allocation on the device will have less resources available. In short, I would not recommend doing it this way, rather initialize just one session and try to reuse it.
EDIT:
In answer to your comment: The session is closed automatically as soon as the with-block is exited. I've read in this github issue that the memory on the GPU is only really released when the whole program exits. But I guess that when you allocate a new session after you closed the last one, Tensorflow will internally just re-use the previously allocated resources. So, in retrospective my answer is probably not very insightful. Sorry if I caused confusion.
It's not possible to be 100% certain without seeing all of your code, but I would guess that the crop_image() function is calling various TensorFlow op functions to build a graph.
It is almost never a good idea to build a graph inside a for loop. This answer explains why: some operations (such as the first Session.run() call to a new operation) take time that is linear in the number of operations in the graph. If you add more operations in each iteration, iteration i will do work that is linear in i, and so the overall execution time will be quadratic.
The modified version of your code (with a with tf.Graph().as_default(): block inside the loop) will be faster because it creates a new, empty tf.Graph in each iteration, and therefore each iteration does a constant amount of work.
An even more efficient solution would be to build the graph and session once, using tf.placeholder() tensors to represent the filename and bbox arguments to crop_image, and feeding different values to these placeholders in each iteration.

Tensorflow issue with GPU on matmul. GPU isn't recognized

I installed tensorflow with gpu, cuda 7.0 and cudnn 6.5. When I import tensorflow it works well.
I am trying to run a simple matrix multiplication on Tensorflow and it doesn't want to use my gpu though it seems to recognize it. I have this issue on my computer with a nvidia geforce 970m and on a cluster with two titan Z.
My first code is :
import tensorflow as tf
import numpy as np
size=100
#I create 2 matrix
mat1 = np.random.random_sample([size, size])*100
mat2 = np.random.random_sample([size, size])*100
a = tf.constant(mat1)
b = tf.constant(mat2)
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
sess.run(c)
This code works and the result is :
Const_1: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:289] Const_1: /job:localhost/replica:0/task:0/gpu:0
Const: /job:localhost/replica:0/task:0/gpu:0
I tensorflow/core/common_runtime/simple_placer.cc:289] Const: /job:localhost/replica:0/task:0/gpu:0
MatMul: /job:localhost/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:289] MatMul: /job:localhost/replica:0/task:0/cpu:0
So in my way, tensorflow uses my gpu to create constant but not for matmul (that is weird). Then, I force the gpu like this :
with tf.device("/gpu:0"):
a = tf.constant(mat1)
b = tf.constant(mat2)
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
sess.run(c)
And Tensorflow returns :
InvalidArgumentError: Cannot assign a device to node 'MatMul': Could not satisfy explicit device specification '/gpu:0'
If someone have the same problem or an idea, I will be glad to read your answer !
I do not have enough reputation to comment, I have come across a similar issue, my question is here
TensorFlow: critical graph operations assigned to cpu rather than gpu