Tensorflow Object Detection API GPU memory issue - gpu

I'm currently trying to train a model based off the model detection zoo for object detection. Running the setup on the CPU works as expected but trying the same on my GPU results in the following error.
2021-03-10 11:46:54.286051: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-03-10 11:46:54.751423: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-03-10 11:46:54.764147: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-03-10 11:46:54.764233: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
Monitoring the GPU information within task manager it seems that tensorflow (as expected as far as I've understood it) trys to allocate the whole memory. Shortly after reaching a specific peak (roughly 7.3 Gb of 8Gb) TF crashes with the error seen in the snippet above.
Solutions for this specific error within the internet / stackoverflow mention that this problem can be solved allowing the dynamic memory growth. Doing this seems to work and TF manages to create atleast one new checkpoint but in the end crashes with an error of a similar category. In that case the CUDA_OUT_OF_MEMORY error.
System Information:
Ryzen 5
16 GB RAM
RTX 2060 Super with 8Gb VRAM
Training Setup:
Tensorflow 2.4
CUDA 11.0 (also tried several combinations of CUDA cuDNN versions)
cuDNN 8.0.4
Originally I wanted to use the pretrained EfficientDet D6 model but also tried several others like EfficientDet D4, CenterNet HourGlass 512x512 and the SSD MobileNet V2 FPNLite. All those models were started with different batch sizes but even on a batch size of 1 the problem still occures. The training images aren't large either (in average 600 x 800). Currently there are a total of 30 images, 15 per class for training (I'm aware that the training data set should be bigger, but it's just to test the setup).
Now my question would be if anybody has an educated guess or another approach of finding the cause of this error as I cannot immagine that my 2060 isn't capable of atleast training a SSD with a batch size of 1 and rather small images. Could it be a hardware fault? If so, is there a way to check that?

I've done a complete reinstallation of every involving component. I might have done something different this time but I cannot say what. Atleast I'm now able to utilize the GPU for training.

Related

Colab giving OOM for allocating 4.29 GB tensor on GPU in tensorflow

I am running a model which allocates [32768,32768] float weight (around 4.29 GB) in its first layer. But it gives an oom error while adding the layer in the sequential model.
This is the output of nvidia-smi before adding layer -
This is the error -
And this is the output of nvidia-smi after the error -
When the Colab GPU is of 13 GB size, why can't it allocate a weight of 4.29 GB?
The other answers on this for e.g., allowing GPU growth doesn't work.
(Note - the GPU and CPU code division in the model creation was originally meant to be on gpu1 and gpu2, but since Colab provides only one GPU, I divided it between CPU and GPU to use RAM from both)
IMHO you don't need to specify GPU explicitly in TF/keras - current versions on Colab will use it when it is available. GPU loading usually takes place at fit and predict times, not at model building - and then you can fine tune memory consumption using batch size.
Please try your code without the with blocks.
And please use proper code copy & paste instead of pictures in future.

TensorFlow model serving on Google AI Platform online prediction too slow with instance batches

I'm trying to deploy a TensorFlow model to Google AI Platform for Online Prediction. I'm having latency and throughput issues.
The model runs on my machine in less than 1 second (with only an Intel Core I7 4790K CPU) for a single image. I deployed it to AI Platform on a machine with 8 cores and an NVIDIA T4 GPU.
When running the model on AI Platform on the mentioned configuration, it takes a little less than a second when sending only one image. If I start sending many requests, each with one image, the model eventually blocks and stops responding. So I'm instead sending batches of images on each request (from 2 to 10, depending on external factors).
The problem is that I expected the batched requests to be almost constant in time. When sending 1 image, the CPU utilization was around 10% and GPU 12%. So I expected that a batch of 9 images would use ~100% of the hardware and respond in the same time ~1 sec, but this is not the case. A batch of 7 to 10 images takes anywhere from 15 to 50 seconds to be processed.
I already tried to optimize my model. I was using map_fn, replaced that with manual loops, switched from Float 32 to Float 16, tried to vectorize the operations as much as possible, but it's still in the same situation.
What am I missing here?
I'm using the latest AI Platform runtime for online prediction (Python 3.7, TensorFlow 2.1, CUDA 10.1).
The model is a large version of YOLOv4 (~250MB in SavedModel format). I've built a few postprocessing algorithms in TensorFlow that operates on the output of the model.
Last but not least, I also tried debugging with TensorBoard, and it turns out that the YOLOv4 part of the TensorFlow Graph is taking ~90% of the processing time. I expected this particular part of the model to be highly parallel.
Thanks in advance for any help with this. Please ask me for any information that you may need to better understand the issue.
UPDATE 2020-07-13: as suggested in a comment below, I also tried running the model on CPU, but it's really slow and suffers of the same problems than with GPU. It doesn't seem to process images from a single request in parallel.
Also, I think I'm running into issues with TensorFlow Serving due to the rate and amount of requests. I used the tensorflow/serving:latest-gpu Docker image locally to test this further. The model answers 3 times faster on my machine (GeForce GTX 1650) than on AI Platform, but its really inconsistent with response times. I'm getting the following response times (<amount of images> <response time in milliseconds>):
3 9004
3 8051
11 4332
1 222
3 4386
3 3547
11 5101
9 3016
10 3122
11 3341
9 4039
11 3783
11 3294
Then, after running for a minute, I start getting delays and errors:
3 27578
3 28563
3 31867
3 18855
{
message: 'Request failed with status code 504',
response: {
data: { error: 'Timed out waiting for notification' },
status: 504
}
}
For others with the same problem as me when using AI Platform:
As stated in a comment from the Google Cloud team here, AI Platform does not execute batches of instances at once. They plan on adding the feature, though.
We've since moved on from AI Platform to a custom deployment of NVIDIA's Triton Inference Server hosted on Google Cloud Compute Engine. We're getting much better performance than we expected, and we can still apply many more optimizations to our model provided by Triton.
Thanks to everyone who tried to help by replying to this answer.
From the Google Cloud documentation:
If you use a simple model and a small set of input instances, you'll find that there is a considerable difference between how long it takes to finish identical prediction requests using online versus batch prediction. It might take a batch job several minutes to complete predictions that are returned almost instantly by an online request. This is a side-effect of the different infrastructure used by the two methods of prediction. AI Platform Prediction allocates and initializes resources for a batch prediction job when you send the request. Online prediction is typically ready to process at the time of request.
This has to do, like the quote says, with the difference in node allocations, specially with:
Node allocation for online prediction:
Keeps at least one node ready over a period of several minutes, to handle requests even when there are none to handle. The ready state ensures that the service can serve each prediction promptly.
You can learn more about that here
The model is a large version of YOLOv4 (~250MB in SavedModel format). I've built a few postprocessing algorithms in TensorFlow that operates on the output of the model.
What are the postprocessing modifications have you made to the YOLOv4? Is it possible that the source of the slowdown are from those operations? One test you can do to validate this hypothesis locally is to benchmark an unmodified version of YOLOv4 against the benchmarks you've already made for your modified version.
Last but not least, I also tried debugging with TensorBoard, and it turns out that the YOLOv4 part of the TensorFlow Graph is taking ~90% of the processing time. I expected this particular part of the model to be highly parallel.
It would be interesting to take a look at the "debugging output" you're mentioning here. If you use https://www.tensorflow.org/guide/profiler#install_the_profiler_and_gpu_prerequisites, what are the breakdown of the most expensive operations? I've had some experience digging into TF ops -- I've found some strange bottlenecks due to CPU <-> GPU data transfer bottlenecks in some cases. Would be happy to hop on a call sometime and take a look with you if you shoot me a DM.

TF Keras NAN Loss when using multiple GPUs

System:
Ubuntu 18.04 LTS
(2) NVIDIA GTX 1080Ti GPUs 11GB
Driver Version: 440.33.01
CUDA Version: 10.0
I am currently using Tensorflow 2.0 (Python) and the tf.keras library to train a CNN.
However, I am encountering an issue when I try to train my model by calling model.fit(). After
I begin training, the loss is normal for 1 ~ 2 steps for the first epoch. But after that, it suddenly becomes NaN loss. If I try to stop the kernel that is running the training script, the whole computer freezes.
This issue only happens when using multiple GPUs. The code I'm using works perfectly fine on a single GPU. I have wrapped all of my code inside the scope of a tf.distribute.MirroredStrategy using with strategy.scope():. I am feeding my network with data from a tf.data.Dataset (though this error occurs regardless of the data I'm using to train).
I then ran some tests:
1) I tried to replace the data in my dataset with random numbers from a distribution, but the loss stil went to NaN.
2) I also tried feeding the numpy arrays directly to .fit(), but that didn't solve the issue.
3) I tried using different optimizers (Adam, RMSprop, SGD), batch sizes (4, 8, 16, 32), and learning rates, none of which helped to solve this problem.
4) I swapped out my network for a simple Multi-layer Perceptron, but the error persisted.
This doesn't appear to be an OOM issue, since the data is relatively small and running watch -n0.1 nvidia-smi reveals that memory usage never exceeds 30% on either of my GPUs. There doesn't seem to be any warning or error in the console output that might hint at the issue either.
Any help is appreciated

Non Deterministic Results Using GPUs with Tensorflow and Tensorflow Serving . .. Why?

We have an object detection model developed in Tensorflow (1.10 and 1.3) that uses a standard CNN and some extra layers. We host the model in Tensorflow Serving 1.13.0 using a saved model format, on Nvidia Tesla V100 GPUs with Cuda 10 and CUDNN 7.4.x. (We use the Google containers images and/or dockerfiles for Tensorflow serving.)
We run unit tests to ensure that prediction results are what we expect. These all work great on CPU. But when we run them on the above GPU/CUDA/CUDNN configuration, we get differences in the prediction probabilities ranging from .001 to .0005.
Our goals are to understand:
why this happens?
is there anything we can do to prevent it?
If there is something we can do to prevent it, does that entail some sort of trade off, such as performance?
We have tried the following experiments:
Multiple runs of the same model on tensorflow GPU using checkpoint with batchsize of 1
results identical
Multiple runs of the same model on GPU using checkpoint with various batchsizes
results off by .001
Multiple runs of the same model on CPU using checkpoint with various batchsizes
results identical
Multiple runs of the same model on tensorflow serviing GPU using checkpoint with batchsize of 1
results identical
Comparing runs with checkpoint to runs with saved model on GPU
results off by .005
Compare runs with checkpoint to runs with savedmodel on CPU
results identical
Experimented with changing the batch_size and setting TF_CUDNN_USE_AUTOTUNE=0 on GPU
reduces max difference from .001 to .0005
Experimented with adding intra_op_parallelism_threads=1, inter_op_parallelism_threads=1 didn’t make any difference when used with TF_CUDNN_USE_AUTOTUNE=0
results no different than the above
IN SUMMARY: We have a few cases where the results of running inference on GPU are different:
Using a checkpoint versus a saved model.
Batchsize = 1 versus various batch sizes
Setting TF_CUDNN_USE_AUTOTUNE=0 reduces the difference when using various batch sizes
This happens with TF 1.10 AND 1.13.1
Again, our goals are to understand:
Why this happens?
Is there anything we can do to prevent it?
If there is something we can do to prevent it, does that entail some sort of trade off, such as performance?
I have some crazy nondeterministic stuff going on, that didn't occur in my laptop's GPU but happened in the server's GPUs.
Solution: Now I call cudaDeviceSynchronize() every time after a call to a cublas, cusolver, etc., function, and the nondeterministic issue dissapeared! :) It made me really crazy and angry but aparently because those libraries use stream, then you can end using the content of a device pointer before the results have been written completely by those libs' functions.

Tensorflow Object Detection API has slow inference time with tensorflow serving

I am unable to match the inference times reported by Google for models released in their model zoo. Specifically I am trying out their faster_rcnn_resnet101_coco model where the reported inference time is 106ms on a Titan X GPU.
My serving system is using TF 1.4 running in a container built from the Dockerfile released by Google. My client is modeled after the inception client also released by Google.
I am running on an Ubuntu 14.04, TF 1.4 with 1 Titan X. My total inference time is 3x worse than reported by Google ~330ms. Making the tensor proto is taking ~150ms and Predict is taking ~180ms. My saved_model.pb is directly from the tar file downloaded from the model zoo. Is there something I am missing? What steps can I take to reduce the inference time?
I was able to solve the two problems by
optimizing the compiler flags. Added the following to bazel-bin --config=opt --copt=-msse4.1 --copt=-msse4.2 --copt=-mavx --copt=-mavx2 --copt=-mfma
Not importing tf.contrib for every inference. In the inception_client sample provided by google, these lines re-import tf.contrib for every forward pass.
Non-max suppression may be the bottleneck: https://github.com/tensorflow/models/issues/2710.
Is the image size 600x600?
I ran similar model with a Titan Xp, however, I user the infer_detections.py script and logged the forward pass time [basically by using datetime before and after
tf_example = detection_inference.infer_detections_and_add_to_example(
serialized_example_tensor, detected_boxes_tensor,
detected_scores_tensor, detected_labels_tensor,
FLAGS.discard_image_pixels)
I had reduced the # of proposals generated in the first stage of FasterRCN from 300 to 100, and reduced the number of detections at the second stage to 100 as well. I got numbers in the range of 80 to 140 ms, and I think that the 600x600 image would approximately take ~106 or slightly less in this set-up (due to Titan Xp, and reduced complexity of model).
Maybe you can repeat the above process on your hardware, that way if the numbers are also ~106 ms for this case, we can attribute the difference to the use of DockerFile and the client. If the numbers are still high, then perhaps it is the hardware.
Would be helpful if someone from Tensorflow Object Detection team can comment on the set up used for generating the numbers in model zoo.
#Vikram Gupta did you check your GPU Usage? Does it get somewhere near 80-100%? I experience very low GPU Usage detecting Objects of a Video Stream with the API and models of the "model zoo".