Google cloud-ml stuck without logs after several iterations - tensorflow

I am training a TF ML job on cloud-ml, and it seems the job is stuck after a few iterations (900 iterations). Surprisingly, when I run the code locally it works fine, and also hyper tuning on GCP continues training but runs slower than my local laptop which has a 1060GTX GPU.
I am also using the runtime version 1.6.
I changed the scale-tier and it doesn't help. What can be the issue?

Related

TensorFlow model serving on Google AI Platform online prediction too slow with instance batches

I'm trying to deploy a TensorFlow model to Google AI Platform for Online Prediction. I'm having latency and throughput issues.
The model runs on my machine in less than 1 second (with only an Intel Core I7 4790K CPU) for a single image. I deployed it to AI Platform on a machine with 8 cores and an NVIDIA T4 GPU.
When running the model on AI Platform on the mentioned configuration, it takes a little less than a second when sending only one image. If I start sending many requests, each with one image, the model eventually blocks and stops responding. So I'm instead sending batches of images on each request (from 2 to 10, depending on external factors).
The problem is that I expected the batched requests to be almost constant in time. When sending 1 image, the CPU utilization was around 10% and GPU 12%. So I expected that a batch of 9 images would use ~100% of the hardware and respond in the same time ~1 sec, but this is not the case. A batch of 7 to 10 images takes anywhere from 15 to 50 seconds to be processed.
I already tried to optimize my model. I was using map_fn, replaced that with manual loops, switched from Float 32 to Float 16, tried to vectorize the operations as much as possible, but it's still in the same situation.
What am I missing here?
I'm using the latest AI Platform runtime for online prediction (Python 3.7, TensorFlow 2.1, CUDA 10.1).
The model is a large version of YOLOv4 (~250MB in SavedModel format). I've built a few postprocessing algorithms in TensorFlow that operates on the output of the model.
Last but not least, I also tried debugging with TensorBoard, and it turns out that the YOLOv4 part of the TensorFlow Graph is taking ~90% of the processing time. I expected this particular part of the model to be highly parallel.
Thanks in advance for any help with this. Please ask me for any information that you may need to better understand the issue.
UPDATE 2020-07-13: as suggested in a comment below, I also tried running the model on CPU, but it's really slow and suffers of the same problems than with GPU. It doesn't seem to process images from a single request in parallel.
Also, I think I'm running into issues with TensorFlow Serving due to the rate and amount of requests. I used the tensorflow/serving:latest-gpu Docker image locally to test this further. The model answers 3 times faster on my machine (GeForce GTX 1650) than on AI Platform, but its really inconsistent with response times. I'm getting the following response times (<amount of images> <response time in milliseconds>):
3 9004
3 8051
11 4332
1 222
3 4386
3 3547
11 5101
9 3016
10 3122
11 3341
9 4039
11 3783
11 3294
Then, after running for a minute, I start getting delays and errors:
3 27578
3 28563
3 31867
3 18855
{
message: 'Request failed with status code 504',
response: {
data: { error: 'Timed out waiting for notification' },
status: 504
}
}
For others with the same problem as me when using AI Platform:
As stated in a comment from the Google Cloud team here, AI Platform does not execute batches of instances at once. They plan on adding the feature, though.
We've since moved on from AI Platform to a custom deployment of NVIDIA's Triton Inference Server hosted on Google Cloud Compute Engine. We're getting much better performance than we expected, and we can still apply many more optimizations to our model provided by Triton.
Thanks to everyone who tried to help by replying to this answer.
From the Google Cloud documentation:
If you use a simple model and a small set of input instances, you'll find that there is a considerable difference between how long it takes to finish identical prediction requests using online versus batch prediction. It might take a batch job several minutes to complete predictions that are returned almost instantly by an online request. This is a side-effect of the different infrastructure used by the two methods of prediction. AI Platform Prediction allocates and initializes resources for a batch prediction job when you send the request. Online prediction is typically ready to process at the time of request.
This has to do, like the quote says, with the difference in node allocations, specially with:
Node allocation for online prediction:
Keeps at least one node ready over a period of several minutes, to handle requests even when there are none to handle. The ready state ensures that the service can serve each prediction promptly.
You can learn more about that here
The model is a large version of YOLOv4 (~250MB in SavedModel format). I've built a few postprocessing algorithms in TensorFlow that operates on the output of the model.
What are the postprocessing modifications have you made to the YOLOv4? Is it possible that the source of the slowdown are from those operations? One test you can do to validate this hypothesis locally is to benchmark an unmodified version of YOLOv4 against the benchmarks you've already made for your modified version.
Last but not least, I also tried debugging with TensorBoard, and it turns out that the YOLOv4 part of the TensorFlow Graph is taking ~90% of the processing time. I expected this particular part of the model to be highly parallel.
It would be interesting to take a look at the "debugging output" you're mentioning here. If you use https://www.tensorflow.org/guide/profiler#install_the_profiler_and_gpu_prerequisites, what are the breakdown of the most expensive operations? I've had some experience digging into TF ops -- I've found some strange bottlenecks due to CPU <-> GPU data transfer bottlenecks in some cases. Would be happy to hop on a call sometime and take a look with you if you shoot me a DM.

Keras showing multiple processes for one training

Whenever I launch a training with Keras, I end up with multiples processes (dozens), as you can see on this htop screenshot. Is that normal?
Could it be the reason of why I experienced memory issues ? The cache becomes full as the training goes, then the swap is activated, and after some hours the machine needs to be restarted.
The training is done on a single GPU, using fit_generator:
training_model.fit_generator(
generator=train_generator,
steps_per_epoch=config["steps"],
epochs=config["epochs"],
verbose=1,
callbacks=callbacks
)
Keras 2.2.4
tensorflow-gpu 1.13.1
CUDA 10.0
Thanks for your help!
From the information which you have provided, it seems to be the case of Out of Memory.
Please confirm if you are using GPU. You can check it by running the command,
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
Also, if we use GPU, the processes can be monitored in nvtop and not on htop.
The best way to identify which operation in your entire project is consuming more memory and time is to use Tensorflow Profiler.
So, you can think of upgrading your Tensorflow Version to 1.15 or 2.2 if possible in which model.fit does the job of model.fit_generator too.
Code to use Profiler is shown below:
# Create a TensorBoard callback
logs = "logs/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tboard_callback = tf.keras.callbacks.TensorBoard(log_dir = logs,
histogram_freq = 1,
profile_batch = '500,520')
model.fit(ds_train,
epochs=2,
validation_data=ds_test,
callbacks = [tboard_callback])
High level details of the execution can be viewed in the Overview page of Profiler can be viewed in the Profile tab of Tensorboard as shown below.
From the above screenshot, it can be understood that Input Processing is taking most of the Execution Time (83%).
Also, it will provide us the Recommendations to reduce the Memory Consumption and hence to Optimize our Execution.
Also, other Options in the Tools dropdown like TraceViewer provide useful information like execution information at Operations Level.
For more information, please refer Profiler Tutorial, Profiler Guide and Youtube Video.
Hope this helps. Happy Learning!

Tensorflow fails to run on GPU from time to time

Solved this problem myself. It was because there were too much images in the celeba dataset and my dataloader was so inefficient. The dataloading took too much time and caused the low speed.
But still, this could not explain why the code was running on the cpu while the gpu memory was also taken up. After all I just transfer to pytorch.
My environment: windows10, cuda 9.0, cudnn 7.0.5, tensorflow-gpu 1.8.0.
I am working a cyclegan model. At first, it worked fine with my toy dataset, and could run on gpu without main problem(though the first 10 iterations took extremely long time, which means it might be running on cpu).
I later tried celeba dataset, only changed the folder name to load the data(I loaded data to the memory all at once, then use my own next_batch function and feed_dict to train the model). Then the problem arose: the GPU memory was still taken according to GPU-Z, but the GPU-load is low(less than 10%), and the training speed is very slow(took more than 10 times than normal), which means the code was running on CPU.
Would anyone please give me some advise? Any help is appreciated, thanks.
What is the batch size that you were trying? If it's too low (something like 2-8) for a small model, the memory consumed will not be much. It all depends on your batch size, the number of parameters in your model, etc. It also depends on the model architecture and how much of the model has components that can be run in parallel. Maybe try increasing your batch size and re-running it?

object detection Training becomes slower in time. Uses more CPU than GPU as the training progresses

System information
What is the top-level directory of the model you are using:research/object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):yes (just VGG-16 implementation for Faster RCNN)
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 16.04
TensorFlow version (use command below):1.4.0
CUDA/cuDNN version:8 and 6
GPU model and memory: NVIDIA-1060 6GB
I am trying to Train a Faster-RCNN with VGG-16 as feature extractor(paper) on my custom dataset using the API.
Training params are same as described in the paper except for, am running for 15k steps only and resizing the images to 1200x1200 with a batch size = 1.
The Training Runs Fine but as the Time progresses The Training becomes slower. It is shifting between CPU and GPU.
The steps where the time around 1sec is running on GPU and the other high numbers like ~20secs is running in CPU I cross verified them using 'top' and 'nvidia-smi'. Why is it shifting between CPU and GPU in the middle? I can understand the shift when the model and logs are getting saved but this I don't understand why.
PS: I am running Only the Train script. Am not running the eval script
Update:
This becomes worse over time.
the secs/step is increasing thus affecting the rate at which the checkpoints and the logs getting stored
It should run less than 1sec/step because that was the speed when I started the training for the first 2k steps. And my dataset is pretty small (300 Images for training).
In my experience, it is possible that the size of your input image is quite too large. When you take a look at the tensorboard during the training session, you can find out that all the reshape calculation are running on GPU. So maybe you can write a python script to resize your input image without changing the aspect ratio, and you can at the same time set your batch size(maybe 4 or 8) a little bit higher. Then your can train your dataset faster and can also get a relative good result (mAP)

tensorflow training is slow, stops running on gpu?

I'm running the cifar_train program from tensorflow's tutorial. The tutorial says it reaches a peak performance of 86% after a few hours. I tried running this on my laptop, and after about 4 days it was still training. I'm wondering if I'm misreading the tutorial by thinking that it'll be finished after a few hours, or if that's the accuracy at that time. Also I'm looking at my gpu usage via afterburner and when I started it, it was around 30-40% usage, but after looking at it overnight the gpu usage is 0 and the cpu is at 100%. Is this normal?