TensorFlow: Allocator ran out of memory trying to allocate

TensorFlow: Allocator ran out of memory trying to allocate - gpu

I've got a DiscoGAN which I've built with Tensorflow Slim. The code executes on a CPU, but throws an out of memory error when trying to run it on a GPU. The stack trace shows the following error multiple times in a single execution:
Allocator (GPU_0_bfc) ran out of memory trying to allocate 288.00MiB. Current allocation summary follows.
That size for that error varies (see full stack trace below) - sometimes it doesn't even get to 288 mb. I've seen 4 bytes shown as an error.
I'm just testing execution right now. I have a batch size of 12 (over 72 images) and each image is 64x64x3.
Some other people who had similar errors managed to sort this problem out by adding the following to their code:
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config = config)
However this didn't do anything to resolve my problem.
Stack trace is: https://pastebin.com/qeisC6PC (too big to paste in here)

Related

I/O in Pytorch DataLoader with np.load extremely slow on SSD

I am trying to load a relatively large batch of float16 multispectral images (BxCxHxW=800x12x256x256) to train a deep learning model. The code for the DataLoader is extremely simple:
import torch
import os
paths = os.listdir("/home/bla/data")
class MultiSpectralImageDataset(Dataset):
def __init__(self, paths):
self.paths = np.array(self.paths)
self.l = len(self.paths)
def __len__(self):
return self.l
def __getitem__(self, idx):
path = self.paths[idx]
image = np.load(path)
return image
dataset = MultiSpectralImageDataset(paths)
loader = DataLoader(dataset, batch_size=800, shuffle=True, pin_memory=True, num_workers=16, drop_last=True)
for i, X in enumerate(loader):
X = X.cuda(non_blocking=True).float()
The images are individual files on a very fast NVME SSD. I can verify the read speed of the SSD with sudo hdparm -tT /dev/nvme1n1. This gives me:
/dev/nvme1n1:
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
readonly = 0 (off)
readahead = 256 (on)
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
geometry = 1907729/64/32, sectors = 3907029168, start = 0
bla#bla:~/workspace$ sudo hdparm -tT /dev/nvme1n1
/dev/nvme1n1:
Timing cached reads: 59938 MB in 2.00 seconds = 30041.04 MB/sec
HDIO_DRIVE_CMD(identify) failed: Inappropriate ioctl for device
Timing buffered disk reads: 6308 MB in 3.00 seconds = 2102.35 MB/sec
This confirms the read speed of the SSD is over 2GB/s. However, when using PyTorch DataLoader, I am not nearly able to match this IO speed. During training, the GPU is idle (0% utilization) most of the time, and the CPU is hardly used (htop shows most cores at 0% usage, some cores at at 0.5-1.5% usage). Running iotop shows
The Total Disk Read speed never surpasses 300MB/s. If I decrease num_workers (say by half), the Total Disk Read emains the same (~200MB/s), and each individual thread doubles in read speed. In particular, I observe that every num_workers iterations, the iteration is extremely slow (takes ~1 minute). This apparently simply means that the loading from disk is too slow, as discussed in the PyTorch forum here
What's weird is that I am 99.9% confident it used to work. I remember constistently reaching almost 100% GPU utilization with the same data-loading procedure.
Things I've tried, but with no successs:
Updating Ubuntu, updating everything with apt udpate & upgrade, rebooting, powering off and restarting
Updating the SSD firmware using fwupd (no updates available)
Giving higher priority to the process by running Python using sudo and using os.nice(-10)
Making space on the SSD (30% of the storage is empty, I have run fstrim -v.
Using memmap, i.e. using the keyword in np.load(path, memmap_mode='r')
I really appreciate any help, as I've been stuck with this problem for weeks now, and what used to take 13 minutes per epoch now takes approximately 1h45 per epoch, making things infeasible to train.

Unaccountable Dask memory usage

I am digging into Dask and (mostly) feel comfortable with it. However I cannot understand what is going on in the following scenario. TBH, I'm sure a question like this has been asked in the past, but after searching for awhile I can't seem to find one that really hits the nail on the head. So here we are!
In the code below, you can see a simple python function with a Dask-delayed decorator on it. In my real use-case scenario this would be a "black box" type function within which I don't care what happens, so long as it stays with a 4 GB memory budget and ultimately returns a pandas dataframe. In this case I've specifically chosen the value N=1.5e8 since this results in a total memory footprint of nearly 2.2 GB (large, but still well within the budget). Finally, when executing this file as a script, I have a "data pipeline" which simply runs the black-box function for some number of ID's, and in the end builds up a result dataframe (which I could then do more stuff with)
The confusing bit comes in when this is executed. I can see that only two function calls are executed at once (which is what I would expect), but I receive the warning message distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 3.16 GiB -- Worker memory limit: 3.73 GiB, and shortly thereafter the script exits prematurely. Where is this memory usage coming from?? Note that if I increase memory_limit="8GB" (which is actually more than my computer has), then the script runs fine and my print statement informs me that the dataframe is indeed only utilizing 2.2 GB of memory
Please help me understand this behavior and, hopefully, implement a more memory-safe approach
Many thanks!
BTW:
In case it is helpful, I'm using python 3.8.8, dask 2021.4.0, and distributed 2021.4.0
I've also confirmed this behavior on a Linux (Ubuntu) machine, as well as a Mac M1. They both show the same behavior, although the Mac M1 fails for the same reason with far less memory usage (N=3e7, or roughly 500 MB)
import time
import pandas as pd
import numpy as np
from dask.distributed import LocalCluster, Client
import dask
#dask.delayed
def do_pandas_thing(id):
print(f"STARTING: {id}")
N = 1.5e8
df = pd.DataFrame({"a": np.arange(N), "b": np.arange(N)})
print(
f"df memory usage {df.memory_usage().sum()/(2**30):.3f} GB",
)
# Simulate a "long" computation
time.sleep(5)
return df.iloc[[-1]] # return the last row
if __name__ == "__main__":
cluster = LocalCluster(
n_workers=2,
memory_limit="4GB",
threads_per_worker=1,
processes=True,
)
client = Client(cluster)
# Evaluate "black box" functions with pandas inside
results = []
for i in range(10):
results.append(do_pandas_thing(i))
# compute
r = dask.compute(results)[0]
print(pd.concat(r, ignore_index=True))

I am unable to reproduce the warning/error with the following versions:
pandas=1.2.4
dask=2021.4.1
python=3.8.8
When the object size increases, the process does crash due to memory, but it's a good idea to have workloads that are a fraction of the available memory:
To put it simply, we weren't thinking about analyzing 100 GB or 1 TB datasets in 2011. Nowadays, my rule of thumb for pandas is that you should have 5 to 10 times as much RAM as the size of your dataset. So if you have a 10 GB dataset, you should really have about 64, preferably 128 GB of RAM if you want to avoid memory management problems. This comes as a shock to users who expect to be able to analyze datasets that are within a factor of 2 or 3 the size of their computer's RAM.
source

Dask-Rapids data movment and out of memory issue

I am using dask (2021.3.0) and rapids(0.18) in my project. In this, I am performing preprocessing task on the CPU, and later the preprocessed data is transferred to GPU for K-means clustering. But in this process, I am getting the following problem:
1 of 1 worker jobs failed: std::bad_alloc: CUDA error: ~/envs/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
(before using GPU memory completely it gave the error i.e. it is not using GPU memory completely)
I have a single GPU of size 40 GB.
Ram size 512 GB.
I am using following snippet of code:
cluster=LocalCluster(n_workers=1, threads_per_worker=1)
cluster.scale(100)
##perform my preprocessing on data and get output on variable A
# convert A varible to cupy
x = A.map_blocks(cp.asarray)
km =KMeans(n_clusters=4)
predict=km.fit_predict(x).compute()
I am also looking for a solution so that the data larger than GPU memory can be preprocessed, and whenever there is a spill in GPU memory the spilled data is transferred into temp directory or CPU (as we do with dask where we define temp directory when there is a spill in RAM).
Any help will be appriciated.

There are several ways to run larger than GPU datasets.
Check out Nick Becker's blog, which has a few methods well documented
Check out BlazingSQL, which is built on top of RAPIDS and can perform out of core processings. You can try it at beta.blazingsql.com.

Corrupted stack detected inside Tensorflow Lite Micro interpreter->Invoke() call with Mobilenet_V1_0.25_224_quant model

I am trying to use the quantized model with Tensorflow Lite Micro, and got a segmentation error inside interpreter->Invoke() call.
Debugger showed that segmentation error occurred on returning from Eval() in conv.cc on Node 28 of CONV_2D, and stack was corrupted. Error message is *** stack smashing detected ***: <unknown> terminated with compiler flags "-fstack-protector-all -Wstack-protector".
My test was simply from the person detection example with model replaced with Mobilenet_V1_0.25_224_quant at the Tensorflow lite pre-trained models site with increased enough kTensorArenaSize and model input/output size changed to 224x224x3 and 1x1001, amd pulled additional required operators.
Also tried a few different models, at another quantified mode Mobilenet_V1_0.25_192_quant is showing the same segfault problem,
But the regular floating point modes Mobilenet_V1_0.25_192, and Mobilenet_V1_0.25_224 run OK with many loops.
Have anyone seen similar problem ? Or is some limitations on Tensorflow Lite Micro that I should be aware of ?
This problem can be reproduced at this commit of forked tensorflow repo.
Build command:
$ bazel build //tensorflow/lite/micro/examples/person_detection:person_detection -c dbg --copt=-fstack-protector-all --copt=-Wstack-protector --copt=-fno-omit-frame-pointer
And run:
$ ./bazel-bin/tensorflow/lite/micro/examples/person_detection/person_detection
Files changed:
tensorflow/lite/micro/examples/person_detection/main_functions.cc
tensorflow/lite/micro/examples/person_detection/model_settings.h
tensorflow/lite/micro/examples/person_detection/person_detect_model_data.cc
Changes in main_functions.cc:
constexpr int kTensorArenaSize = 1400 * 1024;
static tflite::MicroOpResolver<5> micro_op_resolver;
micro_op_resolver.AddBuiltin(tflite::BuiltinOperator_RESHAPE,
tflite::ops::micro::Register_RESHAPE());
micro_op_resolver.AddBuiltin(tflite::BuiltinOperator_SOFTMAX,
tflite::ops::micro::Register_SOFTMAX(), 1, 2);
Changes in model_settings.h
constexpr int kNumCols = 224;
constexpr int kNumRows = 224;
constexpr int kNumChannels = 3;
constexpr int kCategoryCount = 1001;
The last model data file person_detect_model_data.cc is pretty big, please see full file at github.
March 28, 2020:
Also tested on Raspberry Pi 3, results are same as on the x86 Ubuntu 18.04.
pi#raspberrypi:~/tests $ ./person_detection
*** stack smashing detected ***: <unknown> terminated
Aborted
Thanks for your help.
Problem root cause found - Updated on April 2, 2020:
I found that the problem is caused by an array overrun of the layer operation data. Tensorflow microlite has a hidden limit (or I missed document, at least TF microlite runtime does not check) on output channels to maximum 256 in the OpData structure of conv.cc for TF micro lite.
constexpr int kMaxChannels = 256;
....
struct OpData {
...
// Per channel output multiplier and shift.
// TODO(b/141139247): Allocate these dynamically when possible.
int32_t per_channel_output_multiplier[kMaxChannels];
int32_t per_channel_output_shift[kMaxChannels];
...
}
The mobilenet model Mobilenet_V1_0.25_224_quant.tflite is with 1000 output classes, and total of 1001 channels internally. And it caused stack corruption in tflite::PopulateConvolutionQuantizationParams() of tensorflow/lite/kernels/kernel_util.cc:90 for the last Conv2D with output size of 1001.
No problem for TF, and TF lite as they are believed not using this structure definition.
Confirmed with increasing the channels to 1024 on loops of model evaluation calls.
Although most of TF microlite cases are likely with small models, and probably won't run into this problem.
This limit may be better documented and/or to perform check at run-time ?

Is there a way of determining how much GPU memory is in use by TensorFlow?

Tensorflow tends to preallocate the entire available memory on it's GPUs. For debugging, is there a way of telling how much of that memory is actually in use?

(1) There is some limited support with Timeline for logging memory allocations. Here is an example for its usage:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
summary, _ = sess.run([merged, train_step],
feed_dict=feed_dict(True),
options=run_options,
run_metadata=run_metadata)
train_writer.add_run_metadata(run_metadata, 'step%03d' % i)
train_writer.add_summary(summary, i)
print('Adding run metadata for', i)
tl = timeline.Timeline(run_metadata.step_stats)
print(tl.generate_chrome_trace_format(show_memory=True))
trace_file = tf.gfile.Open(name='timeline', mode='w')
trace_file.write(tl.generate_chrome_trace_format(show_memory=True))
You can give this code a try with the MNIST example (mnist with summaries)
This will generate a tracing file named timeline, which you can open with chrome://tracing. Note that this only gives an approximated GPU memory usage statistics. It basically simulated a GPU execution, but doesn't have access to the full graph metadata. It also can't know how many variables have been assigned to the GPU.
(2) For a very coarse measure of GPU memory usage, nvidia-smi will show the total device memory usage at the time you run the command.
nvprof can show the on-chip shared memory usage and register usage at the CUDA kernel level, but doesn't show the global/device memory usage.
Here is an example command: nvprof --print-gpu-trace matrixMul
And more details here:
http://docs.nvidia.com/cuda/profiler-users-guide/#abstract

Here's a practical solution that worked well for me:
Disable GPU memory pre-allocation using TF session configuration:
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
run nvidia-smi -l (or some other utility) to monitor GPU memory consumption.
Step through your code with the debugger until you see the unexpected GPU memory consumption.

There's some code in tensorflow.contrib.memory_stats that will help with this:
from tensorflow.contrib.memory_stats.python.ops.memory_stats_ops import BytesInUse
with tf.device('/device:GPU:0'): # Replace with device you are interested in
bytes_in_use = BytesInUse()
with tf.Session() as sess:
print(sess.run(bytes_in_use))

The TensorFlow profiler has improved memory timeline that is based on real gpu memory allocator information
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/profiler#visualize-time-and-memory

tf.config.experimental.get_memory_info('GPU:0')
Currently returns the following keys:
'current': The current memory used by the device, in bytes.
'peak': The peak memory used by the device across the run of the program, in bytes.

as #V.M previously mentioned, a solution that works well is using: tf.config.experimental.get_memory_info('DEVICE_NAME')
This function returns a dictionary with two keys:
'current': The current memory used by the device, in bytes
'peak': The peak memory used by the device across the run of the program, in bytes.
The value of these keys is the ACTUAL memory used not the allocated one that is returned by nvidia-smi.
In reality, for GPUs, TensorFlow will allocate all the memory by default rendering using nvidia-smi to check for the used memory in your code useless. Even if, tf.config.experimental.set_memory_growth is set to true, Tensorflow will no more allocate the whole available memory but is going to remain in allocating more memory than the one is used and in a discrete manner, i.e. allocates 4589MiB then 8717MiB then 16943MiB then 30651 MiB, etc.
A small note concerning the get_memory_info() is that it doesn't return correct values if used in a tf.function() decorated function. Thus, the peak key shall be used after executing tf.function() decorated function to determine the peak memory used.
For older versions of Tensorflow, tf.config.experimental.get_memory_usage('DEVICE_NAME') was the only available function and only returned the used memory (no option for determining the peak memory).
Final note, you can also consider the Tensorflow Profiler available with Tensorboard as #Peter Mentioned.
Hope this helps :)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas