Performance of tf inside jupyter notebooks vs. command line - tensorflow

I am noticing quite significant performance (speed) differences when running tensorflow code from inside a jupyter notebook, versus running it as a script from the command line.
For example, below are the results of running the MNIST CNN tutorial (https://www.tensorflow.org/code/tensorflow/examples/tutorials/mnist/fully_connected_feed.py)
Setup:
AWS instance with 32 Xeon-CPUS, 62GB memory, 4 K520 GPUS (4GB mem)
Linux: 3.13.0-79 Ubuntu
Tensorflow: 0.10.0rc0 (built from sources with GPU support)
Python: 3.5.2 |Anaconda custom (64-bit)|
CUDA libraries : libcublas.so.7.5 , libcudnn.so.5, libcufft.so.7.5, libcuda.so.1, libcurand.so.7.5
Training steps: 2000
Jupyter notebook execution time:
This does not include time for imports and loading the dataset - just the training phase
CPU times: user 8min 58s, sys: 0 ns, total: 8min 58s
Wall time: 8min 20s
Command line execution:
This is the time for execution of the full script.
real 0m18.803s
user 0m11.326s
sys 0m13.200s
The GPU is being used in both cases, but utilization is higher (typically 35% during training phase for the command-line vs 2-3% for the notebook version). I even tried manually placing it on different GPUs but that doesn't make a big difference to the notebook execution time.
Any ideas / suggestions about why this might be ?

I am seeing the reverse case. GPU utilisation in notebook is better than command line.
I have been training over pong using DQN, the frame speed using command line falls to 17fps, while using notebooks it falls to 100fps.
I saw nvidia-smi stats, which shows 294MB usage in command line method, 984MB usage in Jupiter notebook method.
Don't know the reason, but similar observation in colab also

Related

Can i clear up gpu vram in colab

I'm trying to use aitextgen to finetune 774M gpt 2 on a dataset. unfortunately, no matter what i do, training fails because there are only 80 mb of vram available. how can i clear the vram without restarting the runtime and maybe prevent the vram from being full?
Another solution can be using these code snippets.
1.
!pip install numba
Then:
from numba import cuda
# all of your code and execution
cuda.select_device(0)
cuda.close()
Your problem is discussed in Tensorflow official github. https://github.com/tensorflow/tensorflow/issues/36465
Update: #alchemy reported this to be unrecoverable in terms of turning on.
You can try below code.
device = cuda.get_current_device()
device.reset()
Run the command !nvidia-smi inside a notebook block.
Look for the process id for the GPU that is unnecessary for you to remove for cleaning up vram. Then run the command !kill process_id
It should help you.

Pytorch can move tensor to gpu, but nvidia-smi shows no GPU memory in use

hello~ I am very confused with this situation.
first, both my tf and pytorch can detect my gpu (use torch.cuda,is_available())
but my model which runs just fine on gpus just few days before can only run on cpus today.
it seems pytorch and tf skip passing model to gpu directly.
second, I have test in python interactive mode with:
import torch
x = torch.randn(10000,1000).cuda()
this line worked fine, and when I type
x.device
python shows me that x is on gpu device index 0
but at the same time.
NO GPU MEMORY in use in nvidia-smi
third, when I monitor my gpu states with
watch -n 1 nvidia-smi
I found temperature or power of my gpus does not vary at all for a long while.
any help will be appreciated!!

Low NVIDIA GPU Usage with Keras and Tensorflow

I'm running a CNN with keras-gpu and tensorflow-gpu with a NVIDIA GeForce RTX 2080 Ti on Windows 10. My computer has a Intel Xeon e5-2683 v4 CPU (2.1 GHz). I'm running my code through Jupyter (most recent Anaconda distribution). The output in the command terminal shows that the GPU is being utilized, however the script I'm running takes longer than I expect to train/test on the data and when I open the task manager it looks like the GPU utilization is very low. Here's an image:
Note that the CPU isn't being utilized and nothing else on the task manager suggests anything is being fully utilized. I don't have an ethernet connection and am connected to Wifi (don't think this effects anything but I'm not sure with Jupyter since it runs through the web broswers). I'm training on a lot of data (~128GB) which is all loaded into the RAM (512GB). The model I'm running is a fully convolutional neural network (basically a U-Net architecture) with 566,290 trainable parameters. Things I tried so far:
1. Increasing batch size from 20 to 10,000 (increases GPU usage from ~3-4% to ~6-7%, greatly decreases training time as expected).
2. Setting use_multiprocessing to True and increasing number of workers in model.fit (no effect).
I followed the installation steps on this website: https://www.pugetsystems.com/labs/hpc/The-Best-Way-to-Install-TensorFlow-with-GPU-Support-on-Windows-10-Without-Installing-CUDA-1187/#look-at-the-job-run-with-tensorboard
Note that this installation specifically DOESN'T install CuDNN or CUDA. I've had trouble in the past with getting tensorflow-gpu running with CUDA (although I haven't tried in over 2 years so maybe it's easier with the latest versions) which is why I used this installation method.
Is this most likely the reason why the GPU isn't being fully utilized (no CuDNN/CUDA)? Does it have something to do with the dedicated GPU memory usage being a bottleneck? Or maybe something to do with the network architecture I'm using (number of parameters, etc.)?
Please let me know if you need any more information about my system or the code/data I'm running on to help diagnose. Thanks in advance!
EDIT: I noticed something interesting in the task manager. An epoch with batch size of 10,000 takes around 200s. For the last ~5s of each epoch, the GPU usage increases to ~15-17% (up from ~6-7% for the first 195s of each epoch). Not sure if this helps or indicates there's a bottleneck somewhere besides the GPU.
You for sure need to install CUDA/Cudnn to fully utilize GPU with tensorflow. You can double check that the packages are installed correctly and if the GPU is available to tensorflow/keras by using
import tensorflow as tf
tf.config.list_physical_devices("GPU")
and the output should look something like [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
if the device is available.
If you've installed CUDA/Cudnn correctly then all you need to do is change copy --> cuda in the dropdown menu in the task manager which will show the number of active cuda cores. The other indicators for the GPU will not be active when running tf/keras because there is no video encoding/decoding etc to be done; it is simply using the cuda cores on the GPU so the only way to track GPU usage is to look at the cuda utilization (when considering monitoring from the task manager)
I would first start by running one of the short "tests" to ensure Tensorflow is utilizing the GPU. For example, I prefer #Salvador Dali's answer in that linked question
import tensorflow as tf
with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
with tf.Session() as sess:
print (sess.run(c))
If Tensorflow is indeed using your GPU you should see the result of the matrix multplication printed. Otherwise a fairly long stack trace stating that "gpu:0" cannot be found.
If this all works well that I would recommend utilizing Nvidia's smi.exe utility. It is available on both Windows and Linux and AFAIK installs with the Nvidia driver. On a windows system it is located at
C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe
Open a windows command prompt and navigate to that directory. Then run
nvidia-smi.exe -l 3
This will show you a screen like so, that updates every three seconds.
Here we can see various information about the state of the GPUs and what they are doing. Of specific interest in this case is the "Pwr: Usage/Cap" and "Volatile GPU-Util" columns. If your model is indeed using the/a GPU these columns should increase "instantaneously" once you start training the model.
You most likely will see an increase in fan speed and temperature unless you have a very nice cooling solution. In the bottom of the printout you should also see a Process with a name akin to "python" or "Jupityr" running.
If this fails to provide an answers as to the slow training times than I would surmise the issue lies with the model and code itself. And I think its is actually the case here. Specifically viewing the Windows Task Managers listing for "Dedicated GPU Memory Usage" pinged at basically maximum.
If you have tried #KDecker's and #OverLordGoldDragon's solution, low GPU usage is still there, I would suggest first investigating your data pipeline. The following two figures are from tensorflow official guides data performance, they are well illustrated how data pipeline will affect the GPU efficiency.
As you can see, prepare data in parallel with the training will increase the GPU usage. In this situation, CPU processing is becoming the bottleneck. You need to find a mechanism to hide the latency of preprocessing, such as changing the number of processes, size of butter etc. The efficiency of CPU should match the efficiency of the GPU. In this way, the GPU will be maximally utilized.
Take a look at Tensorpack, and it has detailed tutorials of how to speed up your input data pipeline.
Everything works as expected; your dedicated memory usage is nearly maxed, and neither TensorFlow nor CUDA can use shared memory -- see this answer.
If your GPU runs OOM, the only remedy is to get a GPU with more dedicated memory, or decrease model size, or use below script to prevent TensorFlow from assigning redundant resources to the GPU (which it does tend to do):
## LIMIT GPU USAGE
config = tf.ConfigProto()
config.gpu_options.allow_growth = True # don't pre-allocate memory; allocate as-needed
config.gpu_options.per_process_gpu_memory_fraction = 0.95 # limit memory to be allocated
K.tensorflow_backend.set_session(tf.Session(config=config)) # create sess w/ above settings
The unusual increased usage you observe may be shared memory resources being temporarily accessed due to exhausting other available resources, especially with use_multiprocessing=True - but unsure, could be other causes
There seems to have been a change to the installation method you referenced : https://www.pugetsystems.com/labs/hpc/The-Best-Way-to-Install-TensorFlow-with-GPU-Support-on-Windows-10-Without-Installing-CUDA-1187
It is now much easier and should eliminate the problems you are experiencing.
Important Edit You don't seem to be looking at the actual compute of the GPU, look at the attached image:
read following two pages ,u will get idea to properly setup with GPU
https://medium.com/#kegui/how-do-i-know-i-am-running-keras-model-on-gpu-a9cdcc24f986
https://datascience.stackexchange.com/questions/41956/how-to-make-my-neural-netwok-run-on-gpu-instead-of-cpu

object detection Training becomes slower in time. Uses more CPU than GPU as the training progresses

System information
What is the top-level directory of the model you are using:research/object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):yes (just VGG-16 implementation for Faster RCNN)
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 16.04
TensorFlow version (use command below):1.4.0
CUDA/cuDNN version:8 and 6
GPU model and memory: NVIDIA-1060 6GB
I am trying to Train a Faster-RCNN with VGG-16 as feature extractor(paper) on my custom dataset using the API.
Training params are same as described in the paper except for, am running for 15k steps only and resizing the images to 1200x1200 with a batch size = 1.
The Training Runs Fine but as the Time progresses The Training becomes slower. It is shifting between CPU and GPU.
The steps where the time around 1sec is running on GPU and the other high numbers like ~20secs is running in CPU I cross verified them using 'top' and 'nvidia-smi'. Why is it shifting between CPU and GPU in the middle? I can understand the shift when the model and logs are getting saved but this I don't understand why.
PS: I am running Only the Train script. Am not running the eval script
Update:
This becomes worse over time.
the secs/step is increasing thus affecting the rate at which the checkpoints and the logs getting stored
It should run less than 1sec/step because that was the speed when I started the training for the first 2k steps. And my dataset is pretty small (300 Images for training).
In my experience, it is possible that the size of your input image is quite too large. When you take a look at the tensorboard during the training session, you can find out that all the reshape calculation are running on GPU. So maybe you can write a python script to resize your input image without changing the aspect ratio, and you can at the same time set your batch size(maybe 4 or 8) a little bit higher. Then your can train your dataset faster and can also get a relative good result (mAP)

Does TensorFlow job use multiple cores by default?

I am running the imagenet from TensorFlow models repository. I've instrumented sess.run as described in Github comment and got the following view in the chrome://tracing
I am wondering if TF sometime uses multiple cores or single core all the time. I'd think it is using multiple cores when ops can run in parallel as shown in the red box of the figure. However, all these 6 threads are listed under /job:localhost/replicate:0/task:0/cpu:0 which makes me question my interpretation. Does cpu:0 mean all CPU cores?
I am running on a desktop with 8 cores. I run htop to see core utilization during the TF run and I see only one core getting saturated 95-100%.
I found existing answer to this question. All cores are wrapped in cpu:0, i.e., TensorFlow does indeed use multiple CPU cores by default.