I'm going to train a seq2seq model using tf-seq2seq package by 1080 ti (11GB) GPU. I always get the following error using different network's size (even nmt_small):
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Graphics Device
major: 6 minor: 1 memoryClockRate (GHz) 1.582
pciBusID 0000:03:00.0
Total memory: 10.91GiB
Free memory: 10.75GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Graphics Device, pci bus id: 0000:03:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 10.91G (11715084288 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 12337 get requests, put_count=10124 evicted_count=1000 eviction_rate=0.0987752 and unsatisfied allocation rate=0.268542
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:Saving checkpoints for 1 into ../model/model.ckpt.
INFO:tensorflow:step = 1, loss = 5.07399
It seems that tensorflow try to occupy the total amount of the GPU's memory (10.91GiB) but clearly only 10.75GiB is available.
you should notice some tips:
1- use memory growth, from tensorflow document: "in some cases it is desirable for the process to only allocate a subset of the available memory, or to only grow the memory usage as is needed by the process. TensorFlow provides two Config options on the Session to control this."
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config, ...)
2- are you use batch to training? or feed whole data at once? if yes, then decrease your batch size
In addition to both of the suggestions made concerning the memory growth, you can also try:
sess_config = tf.ConfigProto()
sess_config.gpu_options.per_process_gpu_memory_fraction = 0.90
with tf.Session(config=sess_config) as sess:
...
With this you can limit the amount of GPU memory allocated by the program, in this case to 90 percent of the available GPU memory. Maybe this is sufficient to solve your problem of the network trying to allocate more memory than available.
If this is not sufficient, you will have to decrease the batch size or the network's size.
Related
I am trying to run StyleGAN2 using a cluster equipped with eight GPUs (NVIDIA GeForce RTX 2080). At present, I am using the following configuration in training_loop.py:
minibatch_size_dict = {4: 512, 8: 256, 16: 128, 32: 64, 64: 32}, # Resolution-specific overrides.
minibatch_gpu_base = 8, # Number of samples processed at a time by one GPU.
minibatch_gpu_dict = {}, # Resolution-specific overrides.
G_lrate_base = 0.001, # Learning rate for the generator.
G_lrate_dict = {}, # Resolution-specific overrides.
D_lrate_base = 0.001, # Learning rate for the discriminator.
D_lrate_dict = {}, # Resolution-specific overrides.
lrate_rampup_kimg = 0, # Duration of learning rate ramp-up.
tick_kimg_base = 4, # Default interval of progress snapshots.
tick_kimg_dict = {4:10, 8:10, 16:10, 32:10, 64:10, 128:8, 256:6, 512:4}): # Resolution-specific overrides.
I am training using a set of 512x52 pixel images. After a couple of iterations, I get the error message reported below and it looks like the script stops running (using watch nvidia-smi, we have that both the temperature and the fan activity for the GPUs decreases). I already reduced the batch size but it looks like the problem is somewhere else. Do you have any tip on how to fix this?
I was able to run StyleGAN with the same dataset. In the paper they say that StyleGAN2 should be less heavy, so I am a bit surprised.
Here is the error message I get:
2019-12-16 18:22:54.909009: E tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 334.11M (350338048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-12-16 18:22:54.909087: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 129.00MiB (rounded to 135268352). Current allocation summary follows.
2019-12-16 18:22:54.918750: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **_***************************_*****x****x******xx***_******************************_***************
2019-12-16 18:22:54.918808: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at conv_grad_input_ops.cc:903 : Resource exhausted: OOM when allocating tensor with shape[4,128,257,257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
The config-f model for StyleGAN2 is actually bigger than StyleGAN1. Try using a less VRAM consuming configuration like config-e. You can actually change the configuration of the model by passing a flag in your python command like so: https://github.com/NVlabs/stylegan2/blob/master/run_training.py#L144
In my case, I'm able to train StyleGAN2 with config-e on 2 RTX 2080ti.
One or more high-end NVIDIA GPUs, NVIDIA drivers, CUDA 10.0 toolkit
and cuDNN 7.5. To reproduce the results reported in the paper, you
need an NVIDIA GPU with at least 16 GB of DRAM.
Your NVIDIA GeForce RTX 2080 card has 11GB, but I guess you're saying you have 8 of them? I don't think tensorflow is setup for parallelism out of the box.
I use keras pre-trained InceptionResNetV2 to extract image features.
But it always causes CUDA_ERROR_OUT_OF_MEMORY when I predict images, even though I only predict a single file.
The environment is CUDA 10.0, cudnn 7.4, tensorflow 1.13, RTX 2070. GPU memory is 8GB.
Here are codes:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
set_session(tf.Session(config=config))
...
IR2 = InceptionResNetV2(weights='imagenet', include_top=False)
...
features = IR2.predict_on_batch(np.array([test_image]))
#test_image only contains one image
Error messages are:
E tensorflow/stream_executor/cuda/cuda_driver.cc:806] failed to allocate 3.53G (3794432768 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.39GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.39GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
long shot:
Ubuntu 16.04 Nvidia 1070 with 8Gig on board? Machine has 64 Gig ram and dataset is 1 million records and current cuda, cdnn libraries, TensorFlow 1.0 Python 3.6
Not sure how to troubleshoot?
I have been working on getting some models up with TensorFlow and have run into this phenomena a number of times:I don't know of anything other than TensorFlow using the GPU memory?
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.645
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.56GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.92G (8499298304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 1.50G (1614867456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 1.50G (1614867456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cu
The following I get this which indicates some sort of memory allocation is going on? yet still failing.
`I tensorflow/core/common_runtime/bfc_allocator.cc:696] 5 Chunks of size 899200000 totalling 4.19GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1649756928 totalling 1.54GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 6.40GiB
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 8499298304
InUse: 6875780608
MaxInUse: 6878976000
NumAllocs: 338
MaxAllocSize: 1649756928
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ******************************************************************************************xxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 6.10MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:993] Internal: Dst tensor is not initialized.
[[Node: linear/linear/marital_status/marital_status_weights/embedding_lookup_sparse/strided_slice/_1055 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_1643_linear/linear/marital_status/marital_status_weights/embedding_lookup_sparse/strided_slice", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
`
Update: I reduced the record count from millions to 40,000 and got a base model to run to completion. I still get an error message but not continuous ones. I get a bunch of text in the model output suggesting restructuring the model and I suspect that the data structure is a big part of the problem. Could still use some better hints as to how to debug the entire process.. Below is the remaining console output
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.645
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.52GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:1002] failed to allocate 7.92G (8499298304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
[I 09:13:09.297 NotebookApp] Saving file at /Documents/InfluenceH/Working_copies/Cond_fcast_wkg/TensorFlow+DNNLinearCombinedClassifier+for+Influence.ipynb
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
I think the problem is that TensorFlow tries to allocate 7.92GB of GPU memory, while only 7.56GB are actually free. I cannot tell you for what reason the rest of the GPU memory is occupied, but you might avoid this problem by limiting the fraction of the GPU memory your program is allowed to allocate:
sess_config = tf.ConfigProto()
sess_config.gpu_options.per_process_gpu_memory_fraction = 0.9
with tf.Session(config=sess_config, ...) as ...:
With this, the program will only allocate 90 percent of the GPU memory, i.e. 7.13GB.
I am using multiple GPUs (num_gpus = 4) for training one model with multiple towers. The model is training well on one set of GPUs: CUDA_VISIBLE_DEVICES = 0,1,2,3 while it gets OOM problem during the first graph evaluation with CUDA_VISIBLE_DEVICES = 0,1,4,5
Anyone has any ideas why this is happening?
Following options are used for creating a session
session_config=tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=False)
session_config.gpu_options.per_process_gpu_memory_fraction = 0.94
session_config.gpu_options.allow_growth=False
Batch size, is already super small, = 3
System information
Tensorflow 1.0
Cuda 8.0
Ubuntu 14.04.5 LTS
All GPUs : GeForce GTX 1080
logs
name: GeForce GTX 1080 major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:07:00.0 Total memory: 7.92GiB Free memory: 7.81GiB W
tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context
when one is currently active; existing: 0xcc4593a0 I
tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 1
with properties: name: GeForce GTX 1080 major: 6 minor: 1
memoryClockRate (GHz) 1.7335 pciBusID 0000:08:00.0 Total memory:
7.92GiB Free memory: 7.81GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context
when one is currently active; existing: 0xd2404670 I
tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 2
with properties: name: GeForce GTX 1080 major: 6 minor: 1
memoryClockRate (GHz) 1.7335 pciBusID 0000:18:00.0 Total memory:
7.92GiB Free memory: 7.81GiB W tensorflow/stream_executor/cuda/cuda_driver.cc:590] creating context
when one is currently active; existing: 0xd25591b0 I
tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 3
with properties: name: GeForce GTX 1080 major: 6 minor: 1
memoryClockRate (GHz) 1.7335 pciBusID 0000:1c:00.0 Total memory:
7.92GiB Free memory: 7.81GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 1 2 3 I
tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y Y Y Y I
tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 1: Y Y Y Y I
tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 2: Y Y Y Y I
tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 3: Y Y Y Y I
tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci
bus id: 0000:07:00.0) I
tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX 1080, pci
bus id: 0000:08:00.0) I
tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX 1080, pci
bus id: 0000:18:00.0) I
tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:3) -> (device: 3, name: GeForce GTX 1080, pci
bus id: 0000:1c:00.0)
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247]
PoolAllocator: After 47441 get requests, put_count=8461
evicted_count=1000 eviction_rate=0.118189 and unsatisfied allocation
rate=0.844839 I
tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising
pool_size_limit_ from 100 to 110 W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 2.33GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 3.08GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 3.08GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 3.98GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 3.98GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 2.54GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 2.54GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 3.17GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 2.68GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. W
tensorflow/core/common_runtime/bfc_allocator.cc:217] Ran out of memory
trying to allocate 3.86GiB. The caller indicates that this is not a
failure, but may mean that there could be performance gains if more
memory is available. I
tensorflow/core/common_runtime/gpu/pool_allocator.cc:247]
PoolAllocator: After 2698 get requests, put_count=8709 evicted_c
Is the log from a good run, or a bad run? There is no error there, only warnings.
If your system have dual root complex, 0,1,4,5 could be on different partitions. The DMA matrix would show that. Copies between GPUs on the same root complex are generally faster than cross them. If the copy has to hold on the tensor reference for longer due to the longer copy, you might see an increased peak memory usage, and leads to OOM if your models are too close to the limit. Of course, it's just a theory, and without further debugging info, it's difficult to tell for sure.
I've run a image processing script using tensorflow API. It turns out that the processing time decreased quickly when I set the for-loop outside the session running procedure. Could anyone tell me why? Is there any side-effects?
The original code:
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
for i in range(len(file_list)):
start = time.time()
image_crop, bboxs_crop = sess.run(crop_image(file_list[i], bboxs_list[i], sess))
print( 'Done image %d th in %d ms \n'% (i, ((time.time() - start)*1000)))
# image_crop, bboxs_crop, image_debug = sess.run(crop_image(file_list[i], bboxs_list[i], sess))
labels, bboxs = filter_bbox(labels_list[i], bboxs_crop)
# Image._show(Image.fromarray(np.asarray(image_crop)))
# Image._show(Image.fromarray(np.asarray(image_debug)))
save_image(image_crop, ntpath.basename(file_list[i]))
#save_desc_file(file_list[i], labels_list[i], bboxs_crop)
save_desc_file(file_list[i], labels, bboxs)
coord.request_stop()
coord.join(threads)
The code modified:
for i in range(len(file_list)):
with tf.Graph().as_default(), tf.Session() as sess:
start = time.time()
image_crop, bboxs_crop = sess.run(crop_image(file_list[i], bboxs_list[i], sess))
print( 'Done image %d th in %d ms \n'% (i, ((time.time() - start)*1000)))
labels, bboxs = filter_bbox(labels_list[i], bboxs_crop)
save_image(image_crop, ntpath.basename(file_list[i]))
save_desc_file(file_list[i], labels, bboxs)
The time cost in the original code would keep increasing from 200ms to even 20000ms. While after modified, the the logs messages indicate that there are more than one graph and tensorflow devices were created during running, why is that?
python random_crop_images_hongyuan.py I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcublas.so.8.0 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcudnn.so.5 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcufft.so.8.0 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcuda.so.1 locally I
tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA
library libcurand.so.8.0 locally W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE3 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.1 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.2 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use AVX instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use AVX2 instructions, but these are
available on your machine and could speed up CPU computations. W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use FMA instructions, but these are
available on your machine and could speed up CPU computations. I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful
NUMA node read from SysFS had negative value (-1), but there must be
at least one NUMA node, so returning NUMA node zero I
tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0
with properties: name: GeForce GT 730M major: 3 minor: 5
memoryClockRate (GHz) 0.758 pciBusID 0000:01:00.0 Total memory:
982.88MiB Free memory: 592.44MiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I
tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I
tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3000 th in 317 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3001 th in 325 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3002 th in 312 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3003 th in 147 ms
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 730M, pci
bus id: 0000:01:00.0) Done image 3004 th in 447 ms
My guess is that this happens because creating the session is an expensive operation. May be it could also happen that the session is not properly cleaned when the with-statement is left, so each new allocation on the device will have less resources available. In short, I would not recommend doing it this way, rather initialize just one session and try to reuse it.
EDIT:
In answer to your comment: The session is closed automatically as soon as the with-block is exited. I've read in this github issue that the memory on the GPU is only really released when the whole program exits. But I guess that when you allocate a new session after you closed the last one, Tensorflow will internally just re-use the previously allocated resources. So, in retrospective my answer is probably not very insightful. Sorry if I caused confusion.
It's not possible to be 100% certain without seeing all of your code, but I would guess that the crop_image() function is calling various TensorFlow op functions to build a graph.
It is almost never a good idea to build a graph inside a for loop. This answer explains why: some operations (such as the first Session.run() call to a new operation) take time that is linear in the number of operations in the graph. If you add more operations in each iteration, iteration i will do work that is linear in i, and so the overall execution time will be quadratic.
The modified version of your code (with a with tf.Graph().as_default(): block inside the loop) will be faster because it creates a new, empty tf.Graph in each iteration, and therefore each iteration does a constant amount of work.
An even more efficient solution would be to build the graph and session once, using tf.placeholder() tensors to represent the filename and bbox arguments to crop_image, and feeding different values to these placeholders in each iteration.