Parameter server with CPU only - tensorflow

I'm using tensorflow to do distributed training, I'm confused that since parameter server usaually does not use GPU, can we run parameter server on CPU only machines, I mean with no GPU, or disable GPU with something like CUDA_VISIBLE_DEVICES='', I tried this with TF object detection API, but it reports no GPU device available, I then tried per_process_gpu_memory_fraction=0.0001, it works, but this is not a good solution, as it still take around 400MB gpu memory but does not use it. can some suggest something on this issue?

Related

My tensorflow defaults to using my GPU instead of CPU, which is like 10 times slower. How do I fix this and make it use the CPU?

I have one of those gaming laptops with a discrete GPU and dedicated GPU (NVIDIA GeForce RTX 3070).
I was getting very slow speeds training neural networks on tensorflow. Many, many times slower than another laptop with vastly inferior specs in CPU and GPU.
I think the reason for this slowness is because tensorflow is probably running on the dedicate GPU because when I disable the dedicated GPU, the training time speeds up, like 10 times faster. These are huge differences, an order of magnitude.
I know the kernel is running on the dedicated GPU by default because when I disable the dedicated GPU in the middle of the session, the kernel dies.
Therefore, I think disabling the dedicated GPU has forced it to run on the CPU (AMD Ryzen 9 5900HX), which should be better.
I'm running this on Anaconda using Jupyter Notebook.
How do I force it to use by CPU instead of my GPU.
Edit: This seems to be a complicated issue. Some more information.
With dedicated GPU disabled, when training, according to the task manager the GPU usage is 0% (as expected) and the CPU usage is 40%.
But with dedicated GPU enabled, when training, GPU usage is about 10% and CPU usage is about 20%. This is 10 times slower than the above. Why is it using both, but less CPU?
With dedicated GPU enabled (i.e. the normal situation), according to the task manager, scikit-learn uses the CPU not the GPU. So this problem is specific to tensorflow.
Killing the dedicated GPU in the middle of the session crashes not only the kernel, but opening Jupyter Notebooks as well.
Forcing Anaconda and Jupyter Notebook to use the integrated GPU instead of the dedicated GPU in the Windows Setting doesn't fix the problem. It's still using the dedicated GPU.
Just tell tensorflow to do so:
with tf.device("/CPU:0"): #name might vary
model.fit(...)

Since TensorflowJS can use the GPU via WebGL, why would I need an nVIDIA GPU?

So TensorFlowJS can use WebGL to do GPU computations and train deep learning models. Why isn't this more popular than using CUDA with an nVIDIA GPU? Most people just trying to prototype machine learning models would love to do so on their personal computer, but many of us resort to using expensive cloud services like AWS (although more recently Google Colab helps) for ML training if we don't have a computer with an nVIDIA GPU. I'm sure nVIDIA GPUs are faster than whatever GPU is in my Macbook, but probably any GPU will offer at least an order of magnitude speedup over even a fast CPU and allow for model prototyping, so why aren't well using WebGL GPGPU? There must be a catch I just don't know about.
WebGL backend uses GLSL language to define functions and upload data as shaders - it "works", but you pay huge cost to compile GSLS and upload shaders: warmup time for semi-complex models is immense (we're talking about minutes just to startup). And then memory overhead is 100-200% of what model would normally need - and for larger models, you're GPU memory bound, you don't want to waste that.
Btw, actual inference time once model is warmed up and it fits in memory is ok using WebGL
On the other hand nVidia CUDA libraries provide direct access to GPU, so TF compiled to use them is always going to be much more efficient.
Unfortunately, not many GPU vendors provide libraries like CUDA, so most ML is done on nVidia GPUs
Then there is a next level when you're using TPU instead of GPU - then there is no WebGL to start with
If I select WebGPU with the TFJS benchmark (https://tensorflow.github.io/tfjs/e2e/benchmarks/local-benchmark/index.html) it responds with "WebGPU is not supported. Please use Chrome Canary browser with flag "--enable-unsafe-webgpu" enabled...."
So when that's ready will it be competitive with CUDA? On my laptop it is about 15% faster than WebGL on that benchmark.

Low NVIDIA GPU Usage with Keras and Tensorflow

I'm running a CNN with keras-gpu and tensorflow-gpu with a NVIDIA GeForce RTX 2080 Ti on Windows 10. My computer has a Intel Xeon e5-2683 v4 CPU (2.1 GHz). I'm running my code through Jupyter (most recent Anaconda distribution). The output in the command terminal shows that the GPU is being utilized, however the script I'm running takes longer than I expect to train/test on the data and when I open the task manager it looks like the GPU utilization is very low. Here's an image:
Note that the CPU isn't being utilized and nothing else on the task manager suggests anything is being fully utilized. I don't have an ethernet connection and am connected to Wifi (don't think this effects anything but I'm not sure with Jupyter since it runs through the web broswers). I'm training on a lot of data (~128GB) which is all loaded into the RAM (512GB). The model I'm running is a fully convolutional neural network (basically a U-Net architecture) with 566,290 trainable parameters. Things I tried so far:
1. Increasing batch size from 20 to 10,000 (increases GPU usage from ~3-4% to ~6-7%, greatly decreases training time as expected).
2. Setting use_multiprocessing to True and increasing number of workers in model.fit (no effect).
I followed the installation steps on this website: https://www.pugetsystems.com/labs/hpc/The-Best-Way-to-Install-TensorFlow-with-GPU-Support-on-Windows-10-Without-Installing-CUDA-1187/#look-at-the-job-run-with-tensorboard
Note that this installation specifically DOESN'T install CuDNN or CUDA. I've had trouble in the past with getting tensorflow-gpu running with CUDA (although I haven't tried in over 2 years so maybe it's easier with the latest versions) which is why I used this installation method.
Is this most likely the reason why the GPU isn't being fully utilized (no CuDNN/CUDA)? Does it have something to do with the dedicated GPU memory usage being a bottleneck? Or maybe something to do with the network architecture I'm using (number of parameters, etc.)?
Please let me know if you need any more information about my system or the code/data I'm running on to help diagnose. Thanks in advance!
EDIT: I noticed something interesting in the task manager. An epoch with batch size of 10,000 takes around 200s. For the last ~5s of each epoch, the GPU usage increases to ~15-17% (up from ~6-7% for the first 195s of each epoch). Not sure if this helps or indicates there's a bottleneck somewhere besides the GPU.
You for sure need to install CUDA/Cudnn to fully utilize GPU with tensorflow. You can double check that the packages are installed correctly and if the GPU is available to tensorflow/keras by using
import tensorflow as tf
tf.config.list_physical_devices("GPU")
and the output should look something like [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
if the device is available.
If you've installed CUDA/Cudnn correctly then all you need to do is change copy --> cuda in the dropdown menu in the task manager which will show the number of active cuda cores. The other indicators for the GPU will not be active when running tf/keras because there is no video encoding/decoding etc to be done; it is simply using the cuda cores on the GPU so the only way to track GPU usage is to look at the cuda utilization (when considering monitoring from the task manager)
I would first start by running one of the short "tests" to ensure Tensorflow is utilizing the GPU. For example, I prefer #Salvador Dali's answer in that linked question
import tensorflow as tf
with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
with tf.Session() as sess:
print (sess.run(c))
If Tensorflow is indeed using your GPU you should see the result of the matrix multplication printed. Otherwise a fairly long stack trace stating that "gpu:0" cannot be found.
If this all works well that I would recommend utilizing Nvidia's smi.exe utility. It is available on both Windows and Linux and AFAIK installs with the Nvidia driver. On a windows system it is located at
C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe
Open a windows command prompt and navigate to that directory. Then run
nvidia-smi.exe -l 3
This will show you a screen like so, that updates every three seconds.
Here we can see various information about the state of the GPUs and what they are doing. Of specific interest in this case is the "Pwr: Usage/Cap" and "Volatile GPU-Util" columns. If your model is indeed using the/a GPU these columns should increase "instantaneously" once you start training the model.
You most likely will see an increase in fan speed and temperature unless you have a very nice cooling solution. In the bottom of the printout you should also see a Process with a name akin to "python" or "Jupityr" running.
If this fails to provide an answers as to the slow training times than I would surmise the issue lies with the model and code itself. And I think its is actually the case here. Specifically viewing the Windows Task Managers listing for "Dedicated GPU Memory Usage" pinged at basically maximum.
If you have tried #KDecker's and #OverLordGoldDragon's solution, low GPU usage is still there, I would suggest first investigating your data pipeline. The following two figures are from tensorflow official guides data performance, they are well illustrated how data pipeline will affect the GPU efficiency.
As you can see, prepare data in parallel with the training will increase the GPU usage. In this situation, CPU processing is becoming the bottleneck. You need to find a mechanism to hide the latency of preprocessing, such as changing the number of processes, size of butter etc. The efficiency of CPU should match the efficiency of the GPU. In this way, the GPU will be maximally utilized.
Take a look at Tensorpack, and it has detailed tutorials of how to speed up your input data pipeline.
Everything works as expected; your dedicated memory usage is nearly maxed, and neither TensorFlow nor CUDA can use shared memory -- see this answer.
If your GPU runs OOM, the only remedy is to get a GPU with more dedicated memory, or decrease model size, or use below script to prevent TensorFlow from assigning redundant resources to the GPU (which it does tend to do):
## LIMIT GPU USAGE
config = tf.ConfigProto()
config.gpu_options.allow_growth = True # don't pre-allocate memory; allocate as-needed
config.gpu_options.per_process_gpu_memory_fraction = 0.95 # limit memory to be allocated
K.tensorflow_backend.set_session(tf.Session(config=config)) # create sess w/ above settings
The unusual increased usage you observe may be shared memory resources being temporarily accessed due to exhausting other available resources, especially with use_multiprocessing=True - but unsure, could be other causes
There seems to have been a change to the installation method you referenced : https://www.pugetsystems.com/labs/hpc/The-Best-Way-to-Install-TensorFlow-with-GPU-Support-on-Windows-10-Without-Installing-CUDA-1187
It is now much easier and should eliminate the problems you are experiencing.
Important Edit You don't seem to be looking at the actual compute of the GPU, look at the attached image:
read following two pages ,u will get idea to properly setup with GPU
https://medium.com/#kegui/how-do-i-know-i-am-running-keras-model-on-gpu-a9cdcc24f986
https://datascience.stackexchange.com/questions/41956/how-to-make-my-neural-netwok-run-on-gpu-instead-of-cpu

By default, does TensorFlow use GPU/CPU simultaneously for computing or GPU only?

By default, TensorFlow will use our available GPU devices. That said, does TensorFlow use GPUs and CPUs simultaneously for computing, or GPUs for computing and CPUs for job handling (no matter how, CPUs are always active, as I think)?
Generally it uses both, the CPU and the GPU (assuming you are using a GPU-enabled TensorFlow). What actually gets used depends on the actual operations that your code is using.
For each operation available in TensorFlow, there are several "implementations" of such operation, generally a CPU implementation and a GPU one. Some operations only have CPU implementations as it makes no sense for a GPU implementation, but overall most operations are available for both devices.
If you make custom operations then you need to provide implementations that you want.
TensorFlow operations come packaged with a list of devices they can execute on and a list of associated priorities.
For example, a convolution is very conducive to computation on a GPU; but can still be done on a CPU whereas scalar additions should definitely be done on a CPU. You can override this selection using tf.Device and the key attached to the device of interest.
Someone correct me if I'm wrong.
But from what I'm aware TensorFlow only uses either GPU or CPU depending on what installation you ran. For example if you used pip install TensorFlow for python 2 or python3 -m pip install TensorFlow for python 3 you'll only use the CPU version.
Vise versa for GPU.
If you still have any questions or if this did not correctly answer your question feel free to ask me more.

TensorFlow GPU memory

I have a very large deep neural network. when I try to run it on GPU I get "OOM when allocating". but when I mask the GPU and run on CPU it works (about 100x slower when comparing small model).
My question is if there is any mechanism in tenosrflow that would enable me to run the model on GPU. I assume the CPU uses virtual memory so it can allocates as much as he likes and move between cache/RAM/Disk (thrashing).
is there something similiar on Tensorflow with GPU? that would help me even if it will be 10x slower than regular GPU run
Thanks
GPU memory is currently not extensible (Till something like PASCAL is available)