How to use TCMalloc on Google Cloud ML Engine - tcmalloc

How to use TCMalloc on Google Cloud ML Engine? Or apart from TCMalloc, is there any other way to solve memory leak issues on ML Engine?
Finalizing graph doesn't seem to help.
Memory utilization graph:
I've got out of memory error after training 73 epochs. Here is part of the training log:
11:26:33.707
Job failed.
11:26:20.949
Finished tearing down TensorFlow.
11:25:18.568
The replica master 0 ran out-of-memory and exited with a non-zero status of 247. To find out more about why your job exited please check the logs
11:25:07.785
Clean up finished.
11:25:07.785
Module completed; cleaning up.
11:25:07.783
Module raised an exception for failing to call a subprocess Command '['python', '-m', u'trainer.main', u'--data=gs://', u'--train_log_dir=gs://tfoutput/joboutput', u'--model=trainer.crisp_model', u'--num_threads=32', u'--memory_usage=0.8', u'--max_out_norm=1', u'--train_batch_size=64', u'--sample_size=112', u'--num_gpus=4', u'--allow_growth=True', u'--weight_loss_by_train_size=True', u'-x', returned non-zero exit status -9.
11:23:08.853
PNG warning: Exceeded size limit while expanding chunk
11:18:18.474
epoch 58.0: accuracy = 0.9109
11:17:14.851
2017-05-17 10:17:14.851024: epoch 58, loss = 0.12, lr = 0.085500 (228.9 examples/sec; 0.280 sec/batch)
11:15:39.532
PNG warning: Exceeded size limit while expanding chunk
11:10:23.855
PoolAllocator: After 372618242 get requests, put_count=372618151 evicted_count=475000 eviction_rate=0.00127476 and unsatisfied allocation rate=0.00127518
11:05:32.928
PNG warning: Exceeded size limit while expanding chunk
10:59:26.006
epoch 57.0: accuracy = 0.8868
10:58:24.117
2017-05-17 09:58:24.117444: epoch 57, loss = 0.23, lr = 0.085750 (282.2 examples/sec; 0.227 sec/batch)
10:54:37.440
PNG warning: Exceeded size limit while expanding chunk
10:53:30.323
PoolAllocator: After 366350973 get requests, put_count=366350992 evicted_count=465000 eviction_rate=0.00126927 and unsatisfied allocation rate=0.0012694
10:51:51.417
PNG warning: Exceeded size limit while expanding chunk
10:40:43.811
epoch 56.0: accuracy = 0.7897
10:39:41.308
2017-05-17 09:39:41.308624: epoch 56, loss = 0.06, lr = 0.086000 (273.8 examples/sec; 0.234 sec/batch)
10:38:14.522
PoolAllocator: After 360630699 get requests, put_count=360630659 evicted_count=455000 eviction_rate=0.00126168 and unsatisfied allocation rate=0.00126197
10:36:10.480
PNG warning: Exceeded size limit while expanding chunk
10:21:50.715
epoch 55.0: accuracy = 0.9175
10:20:51.801
PoolAllocator: After 354197216 get requests, put_count=354197255 evicted_count=445000 eviction_rate=0.00125636 and unsatisfied allocation rate=0.00125644
10:20:49.815
2017-05-17 09:20:49.815251: epoch 55, loss = 0.25, lr = 0.086250 (285.6 examples/sec; 0.224 sec/batch)
10:02:56.637
epoch 54.0: accuracy = 0.9191
10:01:57.367
2017-05-17 09:01:57.367369: epoch 54, loss = 0.09, lr = 0.086500 (256.5 examples/sec; 0.249 sec/batch)
10:01:42.365
PoolAllocator: After 347107694 get requests, put_count=347107646 evicted_count=435000 eviction_rate=0.00125321 and unsatisfied allocation rate=0.00125354
09:45:56.116
PNG warning: Exceeded size limit while expanding chunk
09:44:12.698
epoch 53.0: accuracy = 0.9039
09:43:09.888
2017-05-17 08:43:09.888202: epoch 53, loss = 0.10, lr = 0.086750 (307.0 examples/sec; 0.208 sec/batch)
09:41:48.672
PoolAllocator: After 339747205 get requests, put_count=339747210 evicted_count=425000 eviction_rate=0.00125093 and unsatisfied allocation rate=0.00125111
09:36:14.085
PNG warning: Exceeded size limit while expanding chunk
09:35:11.686
PNG warning: Exceeded size limit while expanding chunk
09:34:45.011
PNG warning: Exceeded size limit while expanding chunk
09:31:03.212
PNG warning: Exceeded size limit while expanding chunk
09:28:40.116
PoolAllocator: After 335014430 get requests, put_count=335014342 evicted_count=415000 eviction_rate=0.00123875 and unsatisfied allocation rate=0.00123921
09:27:38.374
PNG warning: Exceeded size limit while expanding chunk
09:25:23.913
PNG warning: Exceeded size limit while expanding chunk
09:25:16.065
epoch 52.0: accuracy = 0.9313
09:24:16.963
2017-05-17 08:24:16.962930: epoch 52, loss = 0.11, lr = 0.087000 (278.7 examples/sec; 0.230 sec/batch)
09:17:48.417
PNG warning: Exceeded size limit while expanding chunk
09:13:34.740
PoolAllocator: After 329380055 get requests, put_count=329379978 evicted_count=405000 eviction_rate=0.00122958 and unsatisfied allocation rate=0.00123001
09:06:09.948
update epoch 51.0: accuracy = 0.9357
09:06:09.948
epoch 51.0: accuracy = 0.9357
09:05:09.575
2017-05-17 08:05:09.575641: epoch 51, loss = 0.11, lr = 0.087250 (248.4 examples/sec; 0.258 sec/batch)
08:59:17.735
PNG warning: Exceeded size limit while expanding chunk
08:55:58.605
PoolAllocator: After 322904781 get requests, put_count=322904714 evicted_count=395000 eviction_rate=0.00122327 and unsatisfied allocation rate=0.00122368
08:48:46.322
PNG warning: Exceeded size limit while expanding chunk
08:47:27.936
epoch 50.0: accuracy = 0.9197
08:46:29.370
2017-05-17 07:46:29.370135: epoch 50, loss = 0.20, lr = 0.087500 (253.2 examples/sec; 0.253 sec/batch)
I've tried using TCMalloc for training on my local machine, there is still a memory leak but less than not using it.

TensorFlow uses jemalloc by default, and that is what is used on CloudML Engine as well:
jemalloc is a general purpose malloc(3) implementation that emphasizes
fragmentation avoidance and scalable concurrency support.
So fragmentation is not likely the root cause of your memory issues.

Related

How to interpret the TensorFlow Lite benchmark app results?

I have used the TensorFlow Lite benchmark app, and I got the following result:
E tflite : Average inference timings in us: Warmup: 119501, Init: 3556, Inference: 135968, Overall max resident set size = 32.0469 MB, total malloc-ed size = 0 MB, in-use allocated/mmapped size = 13.3229 MB
I would like to know what does mean these values: Warmup, Init, Inference, Overall max resident set size, total malloc-ed size and in-use allocated/mmapped size.
I didn't find it in the documentation.
For latencies
Warmup: The latency for the first warmup invocation. Note the latency of the very first invocation may be slower since the code may do extra initialization / allocation
Init: Initialization time (to create the TensorFlow Lite interpreter)
Inference: The average of the inference invocation. This should be the most important metrics in most cases.

Issue in Yolov4 training with backup weights

I am trying to train yolov4 using already saved weights in colab. But the training process stops abruptly after loading the weights. Below is the log, after Create 6 permanent cpu-threads, execution stops:
[yolo] params: iou loss: ciou (4), iou_norm: 0.07, cls_norm: 1.00, scale_x_y: 1.05
nms_kind: greedynms (1), beta = 0.600000
Total BFLOPS 59.563
avg_outputs = 489778
Allocate additional workspace_size = 52.43 MB
Loading weights from
/content/drive/MyDrive/YOLOv4_weight/backup/yolov4_custom_train_final.weights...
seen 64, trained: 192 K-images (3 Kilo-batches_64)
Done! Loaded 162 layers from weights-file
Learning Rate: 0.01, Momentum: 0.949, Decay: 0.0005
Detection layer: 139 - type = 27
Detection layer: 150 - type = 27
Detection layer: 161 - type = 27
Saving weights to backup/yolov4_custom_train_final.weights
Create 6 permanent cpu-threads
The weights I am trying to use were generated after 3000 epochs and yolov4.conv.137 was used as the initial weights data; as the accuracy was low I want to run it for 5000 epochs with the saved weights.

OOm - cannot run StyleGAN2 despite reducing batch size

I am trying to run StyleGAN2 using a cluster equipped with eight GPUs (NVIDIA GeForce RTX 2080). At present, I am using the following configuration in training_loop.py:
minibatch_size_dict = {4: 512, 8: 256, 16: 128, 32: 64, 64: 32}, # Resolution-specific overrides.
minibatch_gpu_base = 8, # Number of samples processed at a time by one GPU.
minibatch_gpu_dict = {}, # Resolution-specific overrides.
G_lrate_base = 0.001, # Learning rate for the generator.
G_lrate_dict = {}, # Resolution-specific overrides.
D_lrate_base = 0.001, # Learning rate for the discriminator.
D_lrate_dict = {}, # Resolution-specific overrides.
lrate_rampup_kimg = 0, # Duration of learning rate ramp-up.
tick_kimg_base = 4, # Default interval of progress snapshots.
tick_kimg_dict = {4:10, 8:10, 16:10, 32:10, 64:10, 128:8, 256:6, 512:4}): # Resolution-specific overrides.
I am training using a set of 512x52 pixel images. After a couple of iterations, I get the error message reported below and it looks like the script stops running (using watch nvidia-smi, we have that both the temperature and the fan activity for the GPUs decreases). I already reduced the batch size but it looks like the problem is somewhere else. Do you have any tip on how to fix this?
I was able to run StyleGAN with the same dataset. In the paper they say that StyleGAN2 should be less heavy, so I am a bit surprised.
Here is the error message I get:
2019-12-16 18:22:54.909009: E tensorflow/stream_executor/cuda/cuda_driver.cc:828] failed to allocate 334.11M (350338048 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory
2019-12-16 18:22:54.909087: W tensorflow/core/common_runtime/bfc_allocator.cc:314] Allocator (GPU_0_bfc) ran out of memory trying to allocate 129.00MiB (rounded to 135268352). Current allocation summary follows.
2019-12-16 18:22:54.918750: W tensorflow/core/common_runtime/bfc_allocator.cc:319] **_***************************_*****x****x******xx***_******************************_***************
2019-12-16 18:22:54.918808: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at conv_grad_input_ops.cc:903 : Resource exhausted: OOM when allocating tensor with shape[4,128,257,257] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
The config-f model for StyleGAN2 is actually bigger than StyleGAN1. Try using a less VRAM consuming configuration like config-e. You can actually change the configuration of the model by passing a flag in your python command like so: https://github.com/NVlabs/stylegan2/blob/master/run_training.py#L144
In my case, I'm able to train StyleGAN2 with config-e on 2 RTX 2080ti.
One or more high-end NVIDIA GPUs, NVIDIA drivers, CUDA 10.0 toolkit
and cuDNN 7.5. To reproduce the results reported in the paper, you
need an NVIDIA GPU with at least 16 GB of DRAM.
Your NVIDIA GeForce RTX 2080 card has 11GB, but I guess you're saying you have 8 of them? I don't think tensorflow is setup for parallelism out of the box.

Understand the OOM mechanism of tensorflow

I am using Glove pre-trained embedding to train my own network. I use
self.embedding = tf.get_variable(name="embedding", shape=self.id2vec_table.shape, initializer=tf.constant_initializer(self.id2vec_table), trainable=False)
and tuning_embedding = tf.nn.embedding_lookup(self.embedding, self.txt_from_mfcc)
to initialize and look up embedding. However when I did the training, the error shows as (the error message is too long, I add here the most important ones I believe)
Sum Total of in-use chunks: 3.85GiB, Limit:
11281927373 InUse: 4131524096 MaxInUse:
6826330624 NumAllocs: 47061 MaxAllocSize:
2842165248 OP_REQUIRES failed at matmul_op.cc:478 : Resource
exhausted: OOM when allocating tensor with shape[4800,400001] and type
float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator
GPU_0_bfc
However, as from the error STATS, my max memory of tesla k80 is 11G, and here it is only used for 40%-70% - around 4 ~7 G, how can my gpu be out of memory since it only uses up at most 70% of the total max memory? I just cannot understand the inner mechanism of how it works.
I also has tried methods from other post such as from
https://stackoverflow.com/questions/42495930/tensorflow-oom-on-gpu
and limit my batch size to 16 or config.gpu_options.allow_growth = True or config.gpu_options.allocator_type = 'BFC' or config.gpu_options.per_process_gpu_memory_fraction = 0.4, the error is still here.
Any help here ?

Tensorflow outof memory after a few iterations

I am running a deeplearning project on a large dataset with GPU. Each image is about 200*200*200 voxel.
During training prograss, I received waring and OutofMemory Error at different iterations, maybe sometimes my program will end at first iteration caused by Outof Memory error, but sometime it will end after being trained hundreds of iterations for the same reason.
So I am wondering if it could be trained and have already ran some iterations, why the Outof Memory error will still occur??? I was not run other programs in that GPU and the batchsize is also fixed. Could someone please help me to fix it or provide some ideas about how to deal with it?
Some details:
Tensorflow will always warning like this:
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 13278 get requests, put_count=13270 evicted_count=1000 eviction_rate=0.075358 and unsatisfied allocation rate=0.0834463
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
and after hunders of iterations, pragram will be stopped by Memory error(part of the output):
...
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x33288a7e00 of size 17408
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x33288ac200 of size 17408
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x33288b0600 of size 6912
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x33288b2100 of size 6912
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x33288b3c00 of size 6912
I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x33288b5700 of size 6912
...
...
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 648.00MiB. See logs for memory state.
Some operations I used: tf.nn.conv3d/tf.nn.conv3d_transpose/tf.nn.batch_normalization