Solved this problem myself. It was because there were too much images in the celeba dataset and my dataloader was so inefficient. The dataloading took too much time and caused the low speed.
But still, this could not explain why the code was running on the cpu while the gpu memory was also taken up. After all I just transfer to pytorch.
My environment: windows10, cuda 9.0, cudnn 7.0.5, tensorflow-gpu 1.8.0.
I am working a cyclegan model. At first, it worked fine with my toy dataset, and could run on gpu without main problem(though the first 10 iterations took extremely long time, which means it might be running on cpu).
I later tried celeba dataset, only changed the folder name to load the data(I loaded data to the memory all at once, then use my own next_batch function and feed_dict to train the model). Then the problem arose: the GPU memory was still taken according to GPU-Z, but the GPU-load is low(less than 10%), and the training speed is very slow(took more than 10 times than normal), which means the code was running on CPU.
Would anyone please give me some advise? Any help is appreciated, thanks.
What is the batch size that you were trying? If it's too low (something like 2-8) for a small model, the memory consumed will not be much. It all depends on your batch size, the number of parameters in your model, etc. It also depends on the model architecture and how much of the model has components that can be run in parallel. Maybe try increasing your batch size and re-running it?
Related
I am training a model using the TF keras API, the issue I am having is that I am unable to maximise the usage of the GPU, it is under-utilised in both memory & processing.
When profiling the model, I can see a lot of operations labelled as _Send which I assume is some data hopping between GPU & CPU.
Since I am using keras, I am not directly placing variables on device so I am not clear on why this is occuring or how to optimise.
Another interesting side effect seems to be that larger batches make training slower, with huge long waits for the GPU to get data from the CPU.
The profiler also suggests:
59.4 % of the total step time sampled is spent on 'Kernel Launch'. It could be due to CPU contention with tf.data. In this case, you may try to set the environment variable TF_GPU_THREAD_MODE=gpu_private.
I have set this env var at the top of the notebook, with no effect - I am not clear on how to check if it is having the intended effect.
Your help here would be greatly appreciated, I have read all the available guides on the tensorflow docs.
Do you know any elegant way to do inference on 2 python processes with 1 GPU tensorflow?
Suppose I have 2 processes, first one is classifying cats/dogs, 2nd one is classifying birds/planes, each process is running different tensorflow model and run on GPU. These 2 models will be given images from different cameras continuously.
Usually, tensorflow will occupy all memory of the entire GPU. So when you start another process, it will crash saying OUT OF MEMORY or failed convolution CUDA or something along that line.
Is there a tutorial/article/sample code that shows how to load 2 models in different processes and both run in parallel?
This is very useful also in case you are running a model inference while you are doing some heavy graphics e.g. playing games. I also want to know how running the model affects the game.
I've tried using python Thread and it works but each model predicts 2 times slower (and you know that python thread is not utilizing multiple CPU cores). I want to use python Process but it's not working. If you have sample few lines of code that work I would appreciate that very much.
I've attached current Thread code also:
As summarized here, you can specify the proportion of GPU memory allocated per process.
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.333)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
Using Keras, it may be simpler to allow 'memory growth' which will expand the allocated memory on demand as described here.
import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except RuntimeError as e:
print(e)
The following should work for Tensorflow 2.0:
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
config = ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.2
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)
Apart from setting gpu memory fraction, you need to enable MPS in CUDA to get better speed if you are running more than one model on GPU simultaneoulsy. Otherwise, inference speed will be slower as compared to single model running on GPU.
sudo nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
sudo nvidia-cuda-mps-control -d
Here 0 is your GPU number
After finishing stop the MPS daemon
echo quit | sudo nvidia-cuda-mps-control
OK. I think I've found the solution now.
I use tensorflow 2 and there are essentially 2 methods to manage the memory usage of GPU.
set memory growth to true
set memory limit to some number
You can use both methods, ignore all the warning messages about out of memory stuff. I still don't know what it exactly means but the model is still running and that's what I care about.
I measured the exact time the model uses to run and it's a lot better than running on CPU. If I run both processes at the same time, the speed drop a bit, but it's still lot better than running on CPU.
For memory growth approach, my GPU is 3GB so first process try to allocate everything and then 2nd process said out of memory. But it still works.
For memory limit approach, I set the limit to some number e.g. 1024 MB. Both processes work.
So What is the right minimum number that you can set?
I tried reducing the memory limit until I found that my model works with 64 MB limit fine. The prediction speed is still the same as when I set the memory limit to 1024 MB. When I set the memory limit to 32MB, I noticed 50% speed drop. When I set to 16 MB, the model refuses to run because it does not have enough memory to store the image tensor.
This means that my model requires minimum of 64 MB which is very little considering that I have 3GB to spare. This also allows me to run the model while playing some video games.
Conclusion: I chose to use the memory limit approach with 64 MB limit. You can check how to use memory limit here: https://www.tensorflow.org/guide/gpu
I suggest you to try changing the memory limit to see the minimum you need for your model. You will see speed drop or model refusing to run when the memory is not enough.
I am running TensorFlow on a machine which has two GPUs, each with 3 GB memory. My batch size is only 2GB, and so can fit on one GPU. Is there any point in training with both GPUs (using CUDA_VISIBLE_DEVICES)? If I did, how would TensorFlow distribute the training?
With regards to memory: I assume that you mean that one data batch is 2GB. However, Tensorflow also requires memory to store variables as well as hidden layer results etc. (to compute gradients). For this reason it also depends on your specific model whether or not the memory will be enough. Your best bet would be to just try with one GPU and see if the program crashes due to memory errors.
With regards to distribution: Tensorflow doesn't do this automatically at all. Each op is placed on some device. By default, if you have any number of GPUs available, all GPU-compatible ops will be placed on the first GPU and the rest on the CPU. This is despite Tensorflow reserving all memory on all GPUs by default.
You should have a look at the GPU guide on the Tensorflow website. The most important thing is that you can use the with tf.device context manager to place ops on other GPUs. Using this, the idea would be to split your batch into X chunks (X = number of GPUs) and define your model on each device, each time taking the respective chunk as input and making sure to reuse variables.
If you are using tf.Estimator, there is some information in this question. It is very easy to do distributed execution here using just two simple wrappers, but I personally haven't been able to use it successfully (pretty slow and crashes randomly with a segfault).
What I'm trying to do is retrain VGG16 on recognizing new types of Image data using Keras with Tensorflow backend.
But the training process seems very slow to me, and after checking my GPU performance in the task manager it seems to me like my GPU is barely even being utilized.
This is my code: https://hastebin.com/pepozayutu.py
This is the output in my console: https://hastebin.com/uhonugenej.md
And this is what my task manager looks like during training: https://imgur.com/a/jRJ66
As you can see the GPU is barely doing anything, so why is my training so slow? It's agonizing to try different setups because each training takes 20-60 min depending on number of epochs.
I have installed Tensorflow-gpu 1.7.0, cuDNN 7.0.5, CUDA 9.0 and Keras 2.1.5. I'm running an NVIDIA GeForce 940MX
edit: I solved it! It appears that the problem was that my GPU was only being used for very short periods, and the reason for that was that the bottleneck was actually loading in the images. I stored my images as 3000x4000 pixel jpgs even though I scale them down to 150x150 or sometimes 64x64 for the CNN anyway. Reducing the size of my images on my disk seemed to get rid of the bottleneck
Increase your batch size to, say, 512. Batch size of 30 is too small.
The new TF.data.dataset API is great for building pieplines that feed data to your model asynchronously. So while your graph is being computed the pipeline will prefetch the data ready to go when the next graph execution cycle starts. With tf.estimators you can run a keras model with a TF.data.dataset pipeline as well. An example:
https://www.dlology.com/blog/an-easy-guide-to-build-new-tensorflow-datasets-and-estimator-with-keras-model/
I solved it! It appears that the problem was that my GPU was only being used for very short periods, and the reason for that was that the bottleneck was actually loading in the images. I stored my images as 3000x4000 pixel jpgs even though I scale them down to 150x150 or sometimes 64x64 for the CNN anyway. Reducing the size of my images on my disk seemed to get rid of the bottleneck
System information
What is the top-level directory of the model you are using:research/object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):yes (just VGG-16 implementation for Faster RCNN)
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 16.04
TensorFlow version (use command below):1.4.0
CUDA/cuDNN version:8 and 6
GPU model and memory: NVIDIA-1060 6GB
I am trying to Train a Faster-RCNN with VGG-16 as feature extractor(paper) on my custom dataset using the API.
Training params are same as described in the paper except for, am running for 15k steps only and resizing the images to 1200x1200 with a batch size = 1.
The Training Runs Fine but as the Time progresses The Training becomes slower. It is shifting between CPU and GPU.
The steps where the time around 1sec is running on GPU and the other high numbers like ~20secs is running in CPU I cross verified them using 'top' and 'nvidia-smi'. Why is it shifting between CPU and GPU in the middle? I can understand the shift when the model and logs are getting saved but this I don't understand why.
PS: I am running Only the Train script. Am not running the eval script
Update:
This becomes worse over time.
the secs/step is increasing thus affecting the rate at which the checkpoints and the logs getting stored
It should run less than 1sec/step because that was the speed when I started the training for the first 2k steps. And my dataset is pretty small (300 Images for training).
In my experience, it is possible that the size of your input image is quite too large. When you take a look at the tensorboard during the training session, you can find out that all the reshape calculation are running on GPU. So maybe you can write a python script to resize your input image without changing the aspect ratio, and you can at the same time set your batch size(maybe 4 or 8) a little bit higher. Then your can train your dataset faster and can also get a relative good result (mAP)