Out of memory at every second trial using Ray Tune - tensorflow

I am tuning the hyperparameters using ray tune. The model is built in the tensorflow library, it occupies a large part of the available GPU memory. I noticed that every second call reports an out of memory error.It looks like the memory is being freed, you can see in the GPU memory usage graph, this is the moment between calls of consecutive trials, between which the OOM error occurred. I add that on smaller models I do not encounter this error and the graph looks the same.
How to deal with this out of memory error in every second trial ?

There's actually a utility that helps avoid this:
https://docs.ray.io/en/master/tune/api_docs/trainable.html#ray.tune.utils.wait_for_gpu
def tune_func(config):
tune.utils.wait_for_gpu()
train()
tune.run(tune_func, resources_per_trial={"GPU": 1}, num_samples=10)

Related

TensorFlow Serving and serving more models than the memory can allow

TensorFlow Serving can serve multiple models by configuring the --model_config_file command line argument. I had success using this feature in small experiments.
However, it's unclear to me what happens when the total memory required by these models is larger than, say, the available GPU memory.
Does the server just crash? Or does it support keeping a subset of models available and possibly unloading/loading models based on the usage?
Thanks.
Trying to load a model when you are out of memory will fail to load that model. There's no dynamic loading/unloading at this time.
As currently written, it will crash if there isn't enough memory for all of the models requested to load. Internally there is a feature to gracefully decline to load a model that doesn't fit, which you could enable by writing a small PR that pipes the ServerCore::Options::total_model_memory_limit_bytes option [1] to a flag in main.cc. Note, however, that the notion of "fitting in memory" is based on a somewhat crude way of estimating model RAM footprint.
As Gautam said, it does not dynamically load/unload, although there is a library implemented for that (which isn't currently used in the released binary), called CachingManager [2].
[1] https://github.com/tensorflow/serving/blob/master/tensorflow_serving/model_servers/server_core.h#L112
[2] https://github.com/tensorflow/serving/blob/master/tensorflow_serving/core/caching_manager.h

Optimizing Tensorflow for a 32-cores computer

I'm running a tensorflow code on an Intel Xeon machine with 2 physical CPU each with 8 cores and hyperthreading, for a grand total of 32 available virtual cores. However, I run the code keeping the system monitor open and I notice that just a small fraction of these 32 vCores are used and that the average CPU usage is below 10%.
I'm quite the tensorflow beginner and I haven't configured the session in any way. My question is: should I somehow tell tensorflow how many cores it can use? Or should I assume that it is already trying to use all of them but there is a bottleneck somewhere else? (for example, slow access to the hard disk)
TensorFlow will attempt to use all available CPU resources by default. You don't need to configure anything for it. There can be many reasons why you might be seeing low CPU usage. Here are some possibilities:
The most common case, as you point out, is the slow input pipeline.
Your graph might be mostly linear, i.e. a long narrow chain of operations on relatively small amounts of data, each depending on outputs of the previous one. When a single operation is running on smallish inputs, there is little benefit in parallelizing it.
You can also be limited by the memory bandwidth.
A single session.run() call takes little time. So, you end up going back and forth between python and the execution engine.
You can find useful suggestions here
Use the timeline to see what is executed when

Google Cloud ML Engine "out-of-memory" error when Memory utilization is nearly zero

I am following the Tensorflow Object Detection API tutorial to train a Faster R-CNN model on my own dataset on Google Cloud. But the following "ran out-of-memory" error kept happening.
The replica master 0 ran out-of-memory and exited with a non-zero status of 247.
And according to the logs, a non-zero exit status -9 was returned. As described in the official documentation, a code of -9 might means the training is using more memory than allocated.
However, the memory utilization is lower than 0.2. So why I am having the memory problem? If it helps, the memory utilization graph is here.
The memory utilization graph is an average across all workers. In the case of an out of memory error, it's also not guaranteed that the final data points are successfully exported (e.g., a huge sudden spike in memory). We are taking steps to make the memory utilization graphs more useful.
If you are using the Master to also do evaluation (as exemplified in most of the samples), then the Master uses ~2x the RAM relative to a normal worker. You might consider using the large_model machine type.
The running_pets tutorial uses the BASIC_GPU tier, so perhaps the GPU has ran out of memory.
The graphs on ML engine currently only show CPU memory utilization.
If this is the case, changing your tier to larger GPUs will solve the problem. Here is some information about the different tiers.
On the same page, you will find an example of yaml file on how to configure it.
Looking at your error, it seems that your ML code is consuming more memory that it is originally allocated.
Try with a machine type that allows you more memory such as "large_model" or "complex_model_l". Use a config.yaml to define it as follows:
trainingInput:
scaleTier: CUSTOM
# 'large_model' for bigger model with lots of data
masterType: large_model
runtimeVersion: "1.4"
There is a similar question Google Cloud machine learning out of memory. Please refer to that link for actual solution.

Why should preprocessing be done on CPU rather than GPU?

The performance guide advises to do the preprocessing on CPU rather that on GPU. The listed reasons are
This prevent the data from going from CPU to GPU to CPU to GPU back again.
This frees the GPU of these tasks to focus on training.
I am not sure to understand either arguments.
Why would preprocessing send the result back to the CPU, esp. if all nodes are on GPU? Why preprocessing operations and not any other operation on the graph, why are they/should they be special?
Even though I understand the rationale behind putting the CPU to work rather than keeping it idle, compared to the huge convolutions and other gradient backpropagation a training step has to do, I would have assumed that random cropping, flip and other standard preprocessing steps on the input images should be nowhere near in term of computation needs, and should be executed in a fraction of the time. Even if we think of preprocessing as mostly moving things around (crop, flips), I think GPU memory should be faster for that. Yet doing preprocessing on the CPU can yield a 6+-fold increase in throughput according to the same guide.
I am assuming of course that preprocessing does not result in a drastic decrease in size of the data (e.g. supersampling or cropping to a much smaller size), in which case the gain in transfer time to the device is obvious. I suppose these are rather extreme cases and do not constitute the basis for the above recommendation.
Can somebody make sense out of this?
It is based on the same logic on how CPU and GPU works. GPU is good at doing repetitive parallelised tasks very well, whereas CPU is good at other computations, which require more processing capabilities.
For example, consider a program, which accepts inputs of two integers from the user and runs a for-loop for 1 Million times to sum the two numbers.
How we can achieve this with the combination of CPU and GPU processing?
We do the initial data (two user input integers) intercept part from the user on CPU and then send the two numbers to GPU and the for-loop to sum the numbers runs on the GPU because that is the repetitive, parallelizable yet simple computation part, which GPU is better at. [Although this example wasn't really exactly related to tensorflow but this concept is the heart of all CPU and GPU processing. Regarding your query: Processing abilities like random cropping, flip and other standard preprocessing steps on the input images might not be computational intensive but GPU doesn't excel in such kind of interrupt related computation either.]
Another thing we need to keep in mind that the latency between CPU and GPU also plays a key role here. Copying and transferring data to and fro CPU and GPU is expensive if compared to the transfer of data between different cache levels inside CPU.
As Dey, 2014 [1] have mentioned:
When a parallelized program is computed on the GPGPU, first the data
is copied from the memory to the GPU and after computation the data is
written back to the memory from the GPU using the PCI-e bus (Refer to
Fig. 1.8). Thus for every computation, data has to be copied to and
fro device-host-memory. Although the computation is very fast in
GPGPU, but because of the gap between the device-host-memory due to
communication via PCI-e, the bottleneck in the performance is
generated.
For this reason it is advisable that:
You do the preprocessing on CPU, where the CPU does the initial
computation, prepares and sends the rest of the repetitive
parallelised tasks to the GPU for further processing.
I once developed a buffer mechanism to increase the data processing between CPU and GPU, and henceforth reduce the negative effects of latency between CPU and GPU. Have a look at this thesis to gain a better understanding of this issue:
EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU)
Now, to answer your question:
Why would preprocessing send the result back to the CPU, esp. if all nodes are on GPU?
As quoted from the performance guide of Tensorflow [2],
When preprocessing occurs on the GPU the flow of data is CPU -> GPU
(preprocessing) -> CPU -> GPU (training). The data is bounced back and
forth between the CPU and GPU.
If you remember the dataflow diagram between the CPU-Memory-GPU mentioned above, the reason for doing the preprocessing on CPU improves performance because:
After computation of nodes on GPU, data is sent back on the memory
and CPU fetches that memory for further processing. GPU does not have
enough memory on-board (on GPU itself) to keep all the data on it for computational prupose. So
back-and-forth of data is inevitable. To optimise this data flow, you
do preprocessing on CPU, then the data (for training purposes), which is prepared for parallelizable tasks, is sent to the memory and GPU
fetches that preprocessed data and work on it.
In the performance guide itself it also mentions that by doing this, and having an efficient input pipeline, you won't starve either CPU or GPU or both, which itself proves the aforementioned logic. Again, in the same performance doc, you will also see the mentioning of
If your training loop runs faster when using SSDs vs HDDs for storing
your input data, you could could be I/O bottlenecked.If this is the
case, you should pre-process your input data, creating a few large
TFRecord files.
This again tries to mention the same CPU-Memory-GPU performance bottleneck, which is mentioned above.
Hope this helps and in case you need more clarification (on CPU-GPU performance), do not hesitate to drop a message!
References:
[1] Somdip Dey, EFFICIENT DATA INPUT/OUTPUT (I/O) FOR FINITE DIFFERENCE TIME DOMAIN (FDTD) COMPUTATION ON GRAPHICS PROCESSING UNIT (GPU), 2014
[2] Tensorflow Performance Guide: https://www.tensorflow.org/performance/performance_guide
I quote at first two arguments from the performance guide and I think
your two questions concern these two arguments respectively.
The data is bounced back and forth between the CPU and GPU. ...
Another benefit is preprocessing on the CPU frees GPU time to focus on training.
(1) Operations like file reader, queue and dequeue can only be performed in CPU, operations like reshape, cast, per_image_standardization can be in CPU or GPU. So a wild guess for your first question: if the code doesn't specify /cpu:0, the program will perform in CPU by readers, then pre-process images in GPU, and finally enqueue and dequeue in CPU. (Not sure I am correct. waiting for an expert to verify...)
(2) For the second question, I have the same doubt too. When you train a large network, most of time is spent on the huge convolutions and the gradient computation, not on preprocessing images. However, when they mean 6X+ increase in samples/sec processed, I think they mean the training on MNIST, where a small network is usually used. So that makes sense. Smaller convolutions spend much less time so the time spent on preprocessing is relatively large. 6X+ increase is possible for this case. But preprocessing on the CPU frees GPU time to focus on training is a reasonable explanation.
Hope this could help you.

cuda kernel not executing or returning an error

I have some cuda code running through some FFTs and other math operations, which works on blocks of 2^n as requested by the user. The code works well when first run, but after running long enough it starts to fail. Eventually it will get to the point where if I run any block size larger then 2^ll I get no data back (all zeros). I've done some testing by modifying the kernel code and from what I can tell the kernel is not executing. I'm trying to figure out why my code stops producing data after multiple iterations on large block sizes.
The issue looks at first glance to be memory leak. I know I have to run multiple iterations of the processing to cause an error. At first only large block sizes will stop working, but as I run more iterations smaller block sizes will start to fail as well. The reason I'm not certain the issue is memory is that my code will work for a block size lower then 2^11 regardless of how many iterations I run. If this was a simple memory leak I would have expected the symptoms to get progressively worse until I couldn't access any memory on the card.
I've also noticed that larger block sizes (roughly equivalent to the amount of memory each thread uses) tend to cause the program to fail sooner. Increasing the number of blocks processed (ie number of Cuda threads) doesn't seem to have an affect on when the code starts to fail.
as far as I can tell no error code is being returned, the kernel doesn't appear to execute at all.
Can anyone suggest what my be causing this issue? I would settle for any insight in how to debug code running on the GPU or to monitor GPU memory availability.
If you need more computation done, bump up your grid size and not your thread block size. To quote the CUDA programming guide 3.0 on pg. 8, "On current GPUs, a thread block may contain up to 512
threads."
This means that threadIdx.x * threadIdx.y * threadIdx.z <= 512 at all times. If you maintain that invariant, do things work?