Does TensorFlow job use multiple cores by default? - tensorflow

I am running the imagenet from TensorFlow models repository. I've instrumented sess.run as described in Github comment and got the following view in the chrome://tracing
I am wondering if TF sometime uses multiple cores or single core all the time. I'd think it is using multiple cores when ops can run in parallel as shown in the red box of the figure. However, all these 6 threads are listed under /job:localhost/replicate:0/task:0/cpu:0 which makes me question my interpretation. Does cpu:0 mean all CPU cores?
I am running on a desktop with 8 cores. I run htop to see core utilization during the TF run and I see only one core getting saturated 95-100%.

I found existing answer to this question. All cores are wrapped in cpu:0, i.e., TensorFlow does indeed use multiple CPU cores by default.

Related

TensorFlow model serving on Google AI Platform online prediction too slow with instance batches

I'm trying to deploy a TensorFlow model to Google AI Platform for Online Prediction. I'm having latency and throughput issues.
The model runs on my machine in less than 1 second (with only an Intel Core I7 4790K CPU) for a single image. I deployed it to AI Platform on a machine with 8 cores and an NVIDIA T4 GPU.
When running the model on AI Platform on the mentioned configuration, it takes a little less than a second when sending only one image. If I start sending many requests, each with one image, the model eventually blocks and stops responding. So I'm instead sending batches of images on each request (from 2 to 10, depending on external factors).
The problem is that I expected the batched requests to be almost constant in time. When sending 1 image, the CPU utilization was around 10% and GPU 12%. So I expected that a batch of 9 images would use ~100% of the hardware and respond in the same time ~1 sec, but this is not the case. A batch of 7 to 10 images takes anywhere from 15 to 50 seconds to be processed.
I already tried to optimize my model. I was using map_fn, replaced that with manual loops, switched from Float 32 to Float 16, tried to vectorize the operations as much as possible, but it's still in the same situation.
What am I missing here?
I'm using the latest AI Platform runtime for online prediction (Python 3.7, TensorFlow 2.1, CUDA 10.1).
The model is a large version of YOLOv4 (~250MB in SavedModel format). I've built a few postprocessing algorithms in TensorFlow that operates on the output of the model.
Last but not least, I also tried debugging with TensorBoard, and it turns out that the YOLOv4 part of the TensorFlow Graph is taking ~90% of the processing time. I expected this particular part of the model to be highly parallel.
Thanks in advance for any help with this. Please ask me for any information that you may need to better understand the issue.
UPDATE 2020-07-13: as suggested in a comment below, I also tried running the model on CPU, but it's really slow and suffers of the same problems than with GPU. It doesn't seem to process images from a single request in parallel.
Also, I think I'm running into issues with TensorFlow Serving due to the rate and amount of requests. I used the tensorflow/serving:latest-gpu Docker image locally to test this further. The model answers 3 times faster on my machine (GeForce GTX 1650) than on AI Platform, but its really inconsistent with response times. I'm getting the following response times (<amount of images> <response time in milliseconds>):
3 9004
3 8051
11 4332
1 222
3 4386
3 3547
11 5101
9 3016
10 3122
11 3341
9 4039
11 3783
11 3294
Then, after running for a minute, I start getting delays and errors:
3 27578
3 28563
3 31867
3 18855
{
message: 'Request failed with status code 504',
response: {
data: { error: 'Timed out waiting for notification' },
status: 504
}
}
For others with the same problem as me when using AI Platform:
As stated in a comment from the Google Cloud team here, AI Platform does not execute batches of instances at once. They plan on adding the feature, though.
We've since moved on from AI Platform to a custom deployment of NVIDIA's Triton Inference Server hosted on Google Cloud Compute Engine. We're getting much better performance than we expected, and we can still apply many more optimizations to our model provided by Triton.
Thanks to everyone who tried to help by replying to this answer.
From the Google Cloud documentation:
If you use a simple model and a small set of input instances, you'll find that there is a considerable difference between how long it takes to finish identical prediction requests using online versus batch prediction. It might take a batch job several minutes to complete predictions that are returned almost instantly by an online request. This is a side-effect of the different infrastructure used by the two methods of prediction. AI Platform Prediction allocates and initializes resources for a batch prediction job when you send the request. Online prediction is typically ready to process at the time of request.
This has to do, like the quote says, with the difference in node allocations, specially with:
Node allocation for online prediction:
Keeps at least one node ready over a period of several minutes, to handle requests even when there are none to handle. The ready state ensures that the service can serve each prediction promptly.
You can learn more about that here
The model is a large version of YOLOv4 (~250MB in SavedModel format). I've built a few postprocessing algorithms in TensorFlow that operates on the output of the model.
What are the postprocessing modifications have you made to the YOLOv4? Is it possible that the source of the slowdown are from those operations? One test you can do to validate this hypothesis locally is to benchmark an unmodified version of YOLOv4 against the benchmarks you've already made for your modified version.
Last but not least, I also tried debugging with TensorBoard, and it turns out that the YOLOv4 part of the TensorFlow Graph is taking ~90% of the processing time. I expected this particular part of the model to be highly parallel.
It would be interesting to take a look at the "debugging output" you're mentioning here. If you use https://www.tensorflow.org/guide/profiler#install_the_profiler_and_gpu_prerequisites, what are the breakdown of the most expensive operations? I've had some experience digging into TF ops -- I've found some strange bottlenecks due to CPU <-> GPU data transfer bottlenecks in some cases. Would be happy to hop on a call sometime and take a look with you if you shoot me a DM.

Low NVIDIA GPU Usage with Keras and Tensorflow

I'm running a CNN with keras-gpu and tensorflow-gpu with a NVIDIA GeForce RTX 2080 Ti on Windows 10. My computer has a Intel Xeon e5-2683 v4 CPU (2.1 GHz). I'm running my code through Jupyter (most recent Anaconda distribution). The output in the command terminal shows that the GPU is being utilized, however the script I'm running takes longer than I expect to train/test on the data and when I open the task manager it looks like the GPU utilization is very low. Here's an image:
Note that the CPU isn't being utilized and nothing else on the task manager suggests anything is being fully utilized. I don't have an ethernet connection and am connected to Wifi (don't think this effects anything but I'm not sure with Jupyter since it runs through the web broswers). I'm training on a lot of data (~128GB) which is all loaded into the RAM (512GB). The model I'm running is a fully convolutional neural network (basically a U-Net architecture) with 566,290 trainable parameters. Things I tried so far:
1. Increasing batch size from 20 to 10,000 (increases GPU usage from ~3-4% to ~6-7%, greatly decreases training time as expected).
2. Setting use_multiprocessing to True and increasing number of workers in model.fit (no effect).
I followed the installation steps on this website: https://www.pugetsystems.com/labs/hpc/The-Best-Way-to-Install-TensorFlow-with-GPU-Support-on-Windows-10-Without-Installing-CUDA-1187/#look-at-the-job-run-with-tensorboard
Note that this installation specifically DOESN'T install CuDNN or CUDA. I've had trouble in the past with getting tensorflow-gpu running with CUDA (although I haven't tried in over 2 years so maybe it's easier with the latest versions) which is why I used this installation method.
Is this most likely the reason why the GPU isn't being fully utilized (no CuDNN/CUDA)? Does it have something to do with the dedicated GPU memory usage being a bottleneck? Or maybe something to do with the network architecture I'm using (number of parameters, etc.)?
Please let me know if you need any more information about my system or the code/data I'm running on to help diagnose. Thanks in advance!
EDIT: I noticed something interesting in the task manager. An epoch with batch size of 10,000 takes around 200s. For the last ~5s of each epoch, the GPU usage increases to ~15-17% (up from ~6-7% for the first 195s of each epoch). Not sure if this helps or indicates there's a bottleneck somewhere besides the GPU.
You for sure need to install CUDA/Cudnn to fully utilize GPU with tensorflow. You can double check that the packages are installed correctly and if the GPU is available to tensorflow/keras by using
import tensorflow as tf
tf.config.list_physical_devices("GPU")
and the output should look something like [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
if the device is available.
If you've installed CUDA/Cudnn correctly then all you need to do is change copy --> cuda in the dropdown menu in the task manager which will show the number of active cuda cores. The other indicators for the GPU will not be active when running tf/keras because there is no video encoding/decoding etc to be done; it is simply using the cuda cores on the GPU so the only way to track GPU usage is to look at the cuda utilization (when considering monitoring from the task manager)
I would first start by running one of the short "tests" to ensure Tensorflow is utilizing the GPU. For example, I prefer #Salvador Dali's answer in that linked question
import tensorflow as tf
with tf.device('/gpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
with tf.Session() as sess:
print (sess.run(c))
If Tensorflow is indeed using your GPU you should see the result of the matrix multplication printed. Otherwise a fairly long stack trace stating that "gpu:0" cannot be found.
If this all works well that I would recommend utilizing Nvidia's smi.exe utility. It is available on both Windows and Linux and AFAIK installs with the Nvidia driver. On a windows system it is located at
C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe
Open a windows command prompt and navigate to that directory. Then run
nvidia-smi.exe -l 3
This will show you a screen like so, that updates every three seconds.
Here we can see various information about the state of the GPUs and what they are doing. Of specific interest in this case is the "Pwr: Usage/Cap" and "Volatile GPU-Util" columns. If your model is indeed using the/a GPU these columns should increase "instantaneously" once you start training the model.
You most likely will see an increase in fan speed and temperature unless you have a very nice cooling solution. In the bottom of the printout you should also see a Process with a name akin to "python" or "Jupityr" running.
If this fails to provide an answers as to the slow training times than I would surmise the issue lies with the model and code itself. And I think its is actually the case here. Specifically viewing the Windows Task Managers listing for "Dedicated GPU Memory Usage" pinged at basically maximum.
If you have tried #KDecker's and #OverLordGoldDragon's solution, low GPU usage is still there, I would suggest first investigating your data pipeline. The following two figures are from tensorflow official guides data performance, they are well illustrated how data pipeline will affect the GPU efficiency.
As you can see, prepare data in parallel with the training will increase the GPU usage. In this situation, CPU processing is becoming the bottleneck. You need to find a mechanism to hide the latency of preprocessing, such as changing the number of processes, size of butter etc. The efficiency of CPU should match the efficiency of the GPU. In this way, the GPU will be maximally utilized.
Take a look at Tensorpack, and it has detailed tutorials of how to speed up your input data pipeline.
Everything works as expected; your dedicated memory usage is nearly maxed, and neither TensorFlow nor CUDA can use shared memory -- see this answer.
If your GPU runs OOM, the only remedy is to get a GPU with more dedicated memory, or decrease model size, or use below script to prevent TensorFlow from assigning redundant resources to the GPU (which it does tend to do):
## LIMIT GPU USAGE
config = tf.ConfigProto()
config.gpu_options.allow_growth = True # don't pre-allocate memory; allocate as-needed
config.gpu_options.per_process_gpu_memory_fraction = 0.95 # limit memory to be allocated
K.tensorflow_backend.set_session(tf.Session(config=config)) # create sess w/ above settings
The unusual increased usage you observe may be shared memory resources being temporarily accessed due to exhausting other available resources, especially with use_multiprocessing=True - but unsure, could be other causes
There seems to have been a change to the installation method you referenced : https://www.pugetsystems.com/labs/hpc/The-Best-Way-to-Install-TensorFlow-with-GPU-Support-on-Windows-10-Without-Installing-CUDA-1187
It is now much easier and should eliminate the problems you are experiencing.
Important Edit You don't seem to be looking at the actual compute of the GPU, look at the attached image:
read following two pages ,u will get idea to properly setup with GPU
https://medium.com/#kegui/how-do-i-know-i-am-running-keras-model-on-gpu-a9cdcc24f986
https://datascience.stackexchange.com/questions/41956/how-to-make-my-neural-netwok-run-on-gpu-instead-of-cpu

How to determine the optimal number of GPUs for my machine learning script?

The cluster I am using has 4 NVIDIA's GPUs (P100) per node. I have a tensorflow code that I need to run. It takes many hours to complete and I tried to use all 4 GPUs available on the node. but it looks like it runs slower if I use all 4 GPUs than if I use only 1GPU and I am not sure why... What is the best strategy to determine how many GPUs should I use for my problem?
It is possible that you didn't optimally structure your code for multi-gpu training if you distributed it layer-wise. Generally training speed should scale roughly linearly with number of GPUs.
Please refer to this answer on what options you have to adapt your network to multi-gpu training.

Strategies for improving performance when using Tensorflow w / C++?

I'm fairly new to Tensorflow in and ML in general and am wondering what strategies I can use to increase performance of an application I am building.
My app is using the Tensorflow C++ interface, with a source compiled TF 0.11 libtensorflow_cc.so (built with bazel build -c opt --copt=-mavx and optionally adding --config=cuda) for either AVX or AVX + CUDA on Mac OS X 10.12.1, on an MacBook Pro 2.8 GHz Intel Core i7 (2 cores 8 threads) with 16GB ram and a Nvidia 750m w/ 2GB VRam)
My application is using Inception V3 model and pulling feature vectors from pool_3 layer. I'm decoding video frames via native API's and passing those in memory buffers to the C++ interface for TF and running them into a session.
I'm not currently batching, but I am caching my session and re-using it for each individual decoded frame / tensor submission. Ive noticed that both CPU and GPU performance is about the same, taking about 40 to 50 seconds to process 222 frames, which seems very slow to me. Ive confirmed CUDA is being invoked, loaded, and the GPU is functioning (or appears so).
Some questions:
In general what should I expect for reasonable performance time wise of TF doing a frame of Inception on a consumer laptop?
How much of a difference does batching make for these operations? For tensors of 1x299x299x3 , I imagine I am doing more PCI transfer waiting than waiting on for meaningful work from the GPU?
if so Is there a good example of batching under C++ for InceptionV3?
Is there operations that cause additional CPU->GPU Syncronization that might otherwise be avoided?
Is there a way to ensure my sessions / graphs share resources ? Can I use nested scopes somehow in this manner? I couldn't quite get that to work but likely missed something.
Any good documentation of general strategies for things to do / avoid?
My code is below:
https://github.com/Synopsis/Synopsis/blob/TensorFlow/Synopsis/TensorFlowAnalyzer/TensorFlowAnalyzer.mm
Thank you very much
For reference, OpenCV analysis using perceptual hash, histogram, dense optical flow, sparse optical flow for point tracking, and simple saliency detection takes 4 to 5 seconds for the same 222 frames using CPU or CPU + OpenCL.
https://github.com/Synopsis/Synopsis/tree/TensorFlow/Synopsis/StandardAnalyzer
Answering your last question first, if there's documentation about performance optimization, yes:
The TensorFlow Performance Guide
The TensorFlow GPU profiling hints
Laptop performance is highly variable, and TF isn't particularly optimized for laptop GPUs. The numbers you're getting (222 frames in 40-50 seconds) ~= 5 fps don't seem crazy on a laptop platform, using the 2016 version of TensorFlow, with inception. With some of the performance improvements outlined in the performance guide above, that should probably be doubled in late 2017.
For batching, yes - the newer example inception model code allows a variable batch size at inference time. This is mostly about whether the model itself was defined to handle a batch size, which is something improved since 2016.
Batching for inference will make a pretty big difference on GPU. Whether it helps on CPU depends a lot -- for example, if you build with MKL-DNN support, batching should be considered mandatory, but basic TensorFlow may not benefit as much.

TensorFlow and Python multiprocessing

I wrote the following piece of code to evaluate the effect of Python multiprocessing while using TensorFlow:
import tensorflow as tf
from multiprocessing import Process
mydevice = "/gpu:0"
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.01)
mrange = 1000
def myfun():
with tf.device(mydevice):
mm1 = tf.constant([[float(i) for i in range(mrange)]],dtype='float32')
mm2 = tf.constant([[float(i)] for i in range(mrange)],dtype='float32')
with tf.device(mydevice):
prod = tf.matmul(mm1,mm2)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True,gpu_options=gpu_options))
rest = sess.run(prod)
print rest
sess.close()
ll = []
for i in range(100):
p1 = Process(target=myfun)
p1.start()
ll.append(p1)
for item in ll:
item.join()
Time taken to run this code on my laptop's GPU: ~6 seconds
If I change the device to CPU: ~6 seconds
If I remove multiprocessing, and call the function serially: 75 seconds
Could someone please explain what exactly would be happening if I use multiprocessing while the device is set to GPU. It is clear that multiple CUDA kernels will be launched, but will they be running concurrently in the GPU?
This is just an experiment to see if I can launch multiple RNNs onto the GPU.
GPUs are mainly designed to render 2D and 3D computer graphics. This involves a lot of number crunching which can benefit from parallel algorithms. Deep learning also involves a lot of parallel number crunching so that the same hardware which accelerates graphics can also accelerate deep learning.
What makes a GPU different from a CPU is that it is optimized for highly parallel number crunching. Look at the specs for any Nvidia GPU and you will see a metric called CUDA Cores. This number is usually somewhere in the thousands range (or hundreds for weaker GPUs). A single CUDA core is a lot weaker than a standard CPU core but since you have so many a GPU can greatly out perform a CPU for parallel tasks. The architecture is actually pretty complex which you can read about if you get into CUDA programming. Take a look at this article. https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units
From the numbers you posted I am guessing you have a weak laptop GPU so that is why it performs about the same as the CPU. On my desktop I have the new GTX 1080 and it can beat my CPU by more than 20x. I am surprised that your numbers go up so much when you call it in serial but I think there is something else going on there since I am not even sure how you would do that with tensorflow.
Fermi and later GPUs support concurrent kernel execution via CUDA streams, which is used by TensorFlow. Therefore, independent ops will run in parallel even if they are in the same graph, launched by a single sess.run call on a single thread, as long as the CUDA runtime thinks it is beneficial to do so.