Google Colab - Session Crashed error while training - tensorflow

I am trying to train a model on around 4500 text sentences. The embedding is however heavy. The session crashes for number of training clauses > 350. It works fine and displays results till 350 sentences though.
Error message:
Your session crashed after using all available RAM.
Runtime type - GPU
I have hereby attached the screenshot of logs.
I am considering training model by batches, but I am a newbie and finding it difficult to have my way around it. Any help will be appreciated. session logs attached

This is basically because of out of memory on Google colab.
Google colab provides ~12GB Free RAM, to extend it 25 GB follow the instruction mentioned here.

Related

Long waiting when running training model with ML

I have trouble for long waiting when I run my training model with Machine Learning using CNNs. Maybe this because my pc has such a bad specs for machine learning.
I have 50000 images for my X_training and must wait up to 1 hours more until it's done.
I think maybe that someone can solve my problem. Thanks a lot
I would recommend you to use Google Collab. It’s free to use. You can access it withing Google Drive and make sure to change the runtime to GPU. In cases such as CNN, using GPUs can make your training process a lot faster.
Also, I don’t know how you are handling images, but if using TensorFlow/Keras I would also recommend you to use the ImageDataGenerator for not loading all images into memory at once, but loading the images needed within each batch. It can save some resources for the computer

TensorFlow model serving on Google AI Platform online prediction too slow with instance batches

I'm trying to deploy a TensorFlow model to Google AI Platform for Online Prediction. I'm having latency and throughput issues.
The model runs on my machine in less than 1 second (with only an Intel Core I7 4790K CPU) for a single image. I deployed it to AI Platform on a machine with 8 cores and an NVIDIA T4 GPU.
When running the model on AI Platform on the mentioned configuration, it takes a little less than a second when sending only one image. If I start sending many requests, each with one image, the model eventually blocks and stops responding. So I'm instead sending batches of images on each request (from 2 to 10, depending on external factors).
The problem is that I expected the batched requests to be almost constant in time. When sending 1 image, the CPU utilization was around 10% and GPU 12%. So I expected that a batch of 9 images would use ~100% of the hardware and respond in the same time ~1 sec, but this is not the case. A batch of 7 to 10 images takes anywhere from 15 to 50 seconds to be processed.
I already tried to optimize my model. I was using map_fn, replaced that with manual loops, switched from Float 32 to Float 16, tried to vectorize the operations as much as possible, but it's still in the same situation.
What am I missing here?
I'm using the latest AI Platform runtime for online prediction (Python 3.7, TensorFlow 2.1, CUDA 10.1).
The model is a large version of YOLOv4 (~250MB in SavedModel format). I've built a few postprocessing algorithms in TensorFlow that operates on the output of the model.
Last but not least, I also tried debugging with TensorBoard, and it turns out that the YOLOv4 part of the TensorFlow Graph is taking ~90% of the processing time. I expected this particular part of the model to be highly parallel.
Thanks in advance for any help with this. Please ask me for any information that you may need to better understand the issue.
UPDATE 2020-07-13: as suggested in a comment below, I also tried running the model on CPU, but it's really slow and suffers of the same problems than with GPU. It doesn't seem to process images from a single request in parallel.
Also, I think I'm running into issues with TensorFlow Serving due to the rate and amount of requests. I used the tensorflow/serving:latest-gpu Docker image locally to test this further. The model answers 3 times faster on my machine (GeForce GTX 1650) than on AI Platform, but its really inconsistent with response times. I'm getting the following response times (<amount of images> <response time in milliseconds>):
3 9004
3 8051
11 4332
1 222
3 4386
3 3547
11 5101
9 3016
10 3122
11 3341
9 4039
11 3783
11 3294
Then, after running for a minute, I start getting delays and errors:
3 27578
3 28563
3 31867
3 18855
{
message: 'Request failed with status code 504',
response: {
data: { error: 'Timed out waiting for notification' },
status: 504
}
}
For others with the same problem as me when using AI Platform:
As stated in a comment from the Google Cloud team here, AI Platform does not execute batches of instances at once. They plan on adding the feature, though.
We've since moved on from AI Platform to a custom deployment of NVIDIA's Triton Inference Server hosted on Google Cloud Compute Engine. We're getting much better performance than we expected, and we can still apply many more optimizations to our model provided by Triton.
Thanks to everyone who tried to help by replying to this answer.
From the Google Cloud documentation:
If you use a simple model and a small set of input instances, you'll find that there is a considerable difference between how long it takes to finish identical prediction requests using online versus batch prediction. It might take a batch job several minutes to complete predictions that are returned almost instantly by an online request. This is a side-effect of the different infrastructure used by the two methods of prediction. AI Platform Prediction allocates and initializes resources for a batch prediction job when you send the request. Online prediction is typically ready to process at the time of request.
This has to do, like the quote says, with the difference in node allocations, specially with:
Node allocation for online prediction:
Keeps at least one node ready over a period of several minutes, to handle requests even when there are none to handle. The ready state ensures that the service can serve each prediction promptly.
You can learn more about that here
The model is a large version of YOLOv4 (~250MB in SavedModel format). I've built a few postprocessing algorithms in TensorFlow that operates on the output of the model.
What are the postprocessing modifications have you made to the YOLOv4? Is it possible that the source of the slowdown are from those operations? One test you can do to validate this hypothesis locally is to benchmark an unmodified version of YOLOv4 against the benchmarks you've already made for your modified version.
Last but not least, I also tried debugging with TensorBoard, and it turns out that the YOLOv4 part of the TensorFlow Graph is taking ~90% of the processing time. I expected this particular part of the model to be highly parallel.
It would be interesting to take a look at the "debugging output" you're mentioning here. If you use https://www.tensorflow.org/guide/profiler#install_the_profiler_and_gpu_prerequisites, what are the breakdown of the most expensive operations? I've had some experience digging into TF ops -- I've found some strange bottlenecks due to CPU <-> GPU data transfer bottlenecks in some cases. Would be happy to hop on a call sometime and take a look with you if you shoot me a DM.

machine learning and model training

I am working on a machine learning project where I am my training my model on Google Colab.
I have cloned the repository and model is build up with tensor flow framework.
However, my data-set is too large. Before running the model I have two questions which are coming to mind:
1) If I leave my model overnight to get trained, what is the smartest way to know that my training is completed/left in between? (Any notification through email . . or ?)
2) What happens, if the internet connection breaks in between
My Google search is not providing me understandable answer. I would appreciate any help with solutions for my queries.
Maximum of 2 instances can be run concurrently and are linked to your Google account. Keep backing up your weights, and re-train if it takes more than 12 hours.
For such long jobs, it's always better to invest in a VPS, but to answer your questions,
The maximum lifetime of a job on Colab with the browser open is 12 hours. Therefore, it's a good idea to periodically save your model weights. A script to backup weights while training is a good idea.
If the internet connection breaks, the notebook will run for 90 minutes before the instance is considered to be idle and will be recycled. It's similar to closing your browser.

Training segmentation model, 4 GPUs are working, 1 fills and getting: "CUDA error: out of memory"

I'm trying to build a segmentation model,and I keep getting
"CUDA error: out of memory",after ivestigating, I realized that all the 4 GPUs are working but one of them is filling.
Some technical details:
My Model:
the model is written in pytorch and has 3.8M parameters.
My Hardware:
I have 4 GPUs with 12GRAM (Titan V) each.
I'm trying to understand why one of my GPUs is filling up, and what am I doing wrong.
Evidence:
as can be seen from the screenshot below, all the GPUs are working, but one of them just keep filling until he gets his limit.
Code:
I'll try to explain what I did in the code:
First my model:
model = model.cuda()
model = nn.DataParallel(model, device_ids=None)
Second, Inputs and targets:
inputs = inputs.to('cuda')
masks = masks.to('cuda')
Those are the lines that working with the GPUs, if I missed something, and you need anything else, please share.
I'm feeling like I'm missing something so basic, that will affect not only this model but also the models in the future, I'll be more than happy for some help.
Thanks a lot!
Without knowing much of the details I can say the following
nvidia-smi is not the most reliable and up-to-date measurement mechanism
the PyTorch GPU allocator does not help either - it will cache blocks of memory artificially blowing up used resources (though it is not an issue here)
I believe there is still a "master" GPU which is the one data is loaded to directly (and then broadcast to other GPUs in DataParallel)
I don't know enough about PyTorch to reliably answer, but you can definitely check if a single GPU setup works with batch size divided by 4. And perhaps if you can load the model + the batch at one (without processing it).

How can I keep the gpu running for more than 12 hours in google colaboratory?

I am trying to train a model on google colaboratoy but the problem is that the GPU gets disconnected after 12 hours and hence I am not able to train my model beyond a certain point. Is there a way to keep GPU connected for longer times?