Strategies for improving performance when using Tensorflow w / C++? - optimization

I'm fairly new to Tensorflow in and ML in general and am wondering what strategies I can use to increase performance of an application I am building.
My app is using the Tensorflow C++ interface, with a source compiled TF 0.11 libtensorflow_cc.so (built with bazel build -c opt --copt=-mavx and optionally adding --config=cuda) for either AVX or AVX + CUDA on Mac OS X 10.12.1, on an MacBook Pro 2.8 GHz Intel Core i7 (2 cores 8 threads) with 16GB ram and a Nvidia 750m w/ 2GB VRam)
My application is using Inception V3 model and pulling feature vectors from pool_3 layer. I'm decoding video frames via native API's and passing those in memory buffers to the C++ interface for TF and running them into a session.
I'm not currently batching, but I am caching my session and re-using it for each individual decoded frame / tensor submission. Ive noticed that both CPU and GPU performance is about the same, taking about 40 to 50 seconds to process 222 frames, which seems very slow to me. Ive confirmed CUDA is being invoked, loaded, and the GPU is functioning (or appears so).
Some questions:
In general what should I expect for reasonable performance time wise of TF doing a frame of Inception on a consumer laptop?
How much of a difference does batching make for these operations? For tensors of 1x299x299x3 , I imagine I am doing more PCI transfer waiting than waiting on for meaningful work from the GPU?
if so Is there a good example of batching under C++ for InceptionV3?
Is there operations that cause additional CPU->GPU Syncronization that might otherwise be avoided?
Is there a way to ensure my sessions / graphs share resources ? Can I use nested scopes somehow in this manner? I couldn't quite get that to work but likely missed something.
Any good documentation of general strategies for things to do / avoid?
My code is below:
https://github.com/Synopsis/Synopsis/blob/TensorFlow/Synopsis/TensorFlowAnalyzer/TensorFlowAnalyzer.mm
Thank you very much
For reference, OpenCV analysis using perceptual hash, histogram, dense optical flow, sparse optical flow for point tracking, and simple saliency detection takes 4 to 5 seconds for the same 222 frames using CPU or CPU + OpenCL.
https://github.com/Synopsis/Synopsis/tree/TensorFlow/Synopsis/StandardAnalyzer

Answering your last question first, if there's documentation about performance optimization, yes:
The TensorFlow Performance Guide
The TensorFlow GPU profiling hints
Laptop performance is highly variable, and TF isn't particularly optimized for laptop GPUs. The numbers you're getting (222 frames in 40-50 seconds) ~= 5 fps don't seem crazy on a laptop platform, using the 2016 version of TensorFlow, with inception. With some of the performance improvements outlined in the performance guide above, that should probably be doubled in late 2017.
For batching, yes - the newer example inception model code allows a variable batch size at inference time. This is mostly about whether the model itself was defined to handle a batch size, which is something improved since 2016.
Batching for inference will make a pretty big difference on GPU. Whether it helps on CPU depends a lot -- for example, if you build with MKL-DNN support, batching should be considered mandatory, but basic TensorFlow may not benefit as much.

Related

Since TensorflowJS can use the GPU via WebGL, why would I need an nVIDIA GPU?

So TensorFlowJS can use WebGL to do GPU computations and train deep learning models. Why isn't this more popular than using CUDA with an nVIDIA GPU? Most people just trying to prototype machine learning models would love to do so on their personal computer, but many of us resort to using expensive cloud services like AWS (although more recently Google Colab helps) for ML training if we don't have a computer with an nVIDIA GPU. I'm sure nVIDIA GPUs are faster than whatever GPU is in my Macbook, but probably any GPU will offer at least an order of magnitude speedup over even a fast CPU and allow for model prototyping, so why aren't well using WebGL GPGPU? There must be a catch I just don't know about.
WebGL backend uses GLSL language to define functions and upload data as shaders - it "works", but you pay huge cost to compile GSLS and upload shaders: warmup time for semi-complex models is immense (we're talking about minutes just to startup). And then memory overhead is 100-200% of what model would normally need - and for larger models, you're GPU memory bound, you don't want to waste that.
Btw, actual inference time once model is warmed up and it fits in memory is ok using WebGL
On the other hand nVidia CUDA libraries provide direct access to GPU, so TF compiled to use them is always going to be much more efficient.
Unfortunately, not many GPU vendors provide libraries like CUDA, so most ML is done on nVidia GPUs
Then there is a next level when you're using TPU instead of GPU - then there is no WebGL to start with
If I select WebGPU with the TFJS benchmark (https://tensorflow.github.io/tfjs/e2e/benchmarks/local-benchmark/index.html) it responds with "WebGPU is not supported. Please use Chrome Canary browser with flag "--enable-unsafe-webgpu" enabled...."
So when that's ready will it be competitive with CUDA? On my laptop it is about 15% faster than WebGL on that benchmark.

Tensorflow Serving Performance Very Slow vs Direct Inference

I am running in the following scenario:
Single Node Kubernetes Cluster (1x i7-8700K, 1x RTX 2070, 32GB RAM)
1 Tensorflow Serving Pod
4 Inference Client Pods
What the inference clients do is they get images from 4 separate cameras (1 each) and pass it to TF-Serving for inference in order to get the understanding of what is seen on the video feeds.
I have previously been doing inference inside the Inference Client Pods individually by calling TensorFlow directly but that hasn't been good on the RAM of the graphics card. Tensorflow Serving has been introduced to the mix quite recently in order to optimize RAM as we don't load duplicated models to the graphics card.
And the performance is not looking good, for a 1080p images it looks like this:
Direct TF: 20ms for input tensor creation, 70ms for inference.
TF-Serving: 80ms for GRPC serialization, 700-800ms for inference.
The TF-Serving pod is the only one that has access to the GPU and it is bound exclusively. Everything else operates on CPU.
Are there any performance tweaks I could do?
The model I'm running is Faster R-CNN Inception V2 from the TF Model Zoo.
Many thanks in advance!
This is from TF Serving documentation:
Please note, while the average latency of performing inference with TensorFlow Serving is usually not lower than using TensorFlow directly, where TensorFlow Serving shines is keeping the tail latency down for many clients querying many different models, all while efficiently utilizing the underlying hardware to maximize throughput.
From my own experience, I've found TF Serving to be useful in providing an abstraction over model serving which is consistent, and does not require implementing custom serving functionalities. Model versioning and multi-model which come out-of-the-box save you lots of time and are great additions.
Additionally, I would also recommend batching your requests if you haven't already. I would also suggest playing around with the TENSORFLOW_INTER_OP_PARALLELISM, TENSORFLOW_INTRA_OP_PARALLELISM, OMP_NUM_THREADS arguments to TF Serving. Here is an explanation of what they are
Maybe you could try OpenVINO? It's a heavily optimized toolkit for inference. You could utilize your i7-8700K and run some frames in parallel. Here are some performance benchmarks for very similar i7-8700T.
There is even OpenVINO Model Server which is very similar to Tensorflow Serving.
Disclaimer: I work on OpenVINO.

Why is the output video too slow in darknet?

I trained my own dataset for yolov2 in darknet. I am using ubuntu 18.04 and has no GPU. When I play a video(which i have taken in my smart phone) for testing, it is too slow. Is it because i don't have a GPU? Or is it because of some other reasons?
Can someone reply me.
Without a gpu, yolov2 is going to be very slow and if you have a modern smart phone it's likely that video is high resolution with a high frame rate. I'm not sure of your implementation but it's likely you're processing every frame in the video instead of skipping every other frame or only processing every 10th frame.
If you don't have a gpu available (and aren't going to) another way to get gpu type performance is using Intel's Openvino if you have a recent I-series processor. You'd be able to convert your yolov2 model to open vino and run it on a cpu with really fast inference times (likely <100ms per frame). I will say I ran yolov3 off of Openvino though and it was really slow compared to other object detectors and especially compared to a mobilenet.
I also have some demo's set up to test between yolov3 on a cpu and open vino on a cpu, you can check those out on SugarKubes
1 big reason is of course because you don't have GPU. The other reason is the model that you use. You use YoloV2 which is faster than YoloV3 but still slower compared to TinyYolo or TinyYoloV3.
So, this is the trade off between accuracy and speed, the faster your model the lower the accuracy. If you are going for speed, than there are 3 solutions that I can think of :
Use GPU (I know it's expensive but worth the price, nvidia gtx 1060++ would be great)
Change your model to TinyYolo or TinyYoloV3. I recommend using TinyYolov3 for higher fps
TinyYoloV3 : 220 fps
TinyYolo : 207 fps
YoloV2 : 67 fps
Use OpenVino as Andrew Pierno said
Download model from here : https://pjreddie.com/darknet/yolo/
Yolov2's link : https://pjreddie.com/darknet/yolov2/

What's the impact of using a GPU in the performance of serving a TensorFlow model?

I trained a neural network using a GPU (1080 ti). The training speed on GPU is far better than using CPU.
Currently, I want to serve this model using TensorFlow Serving. I just interested to know if using GPU in the serving process has a same impact on performance?
Since the training apply on batches but inferencing (serving) uses asynchronous requests, do you suggest using GPU in serving a model using TensorFlow serving?
You still need to do a lot of tensor operations on the graph to predict something. So GPU still provides performance improvement for inference. Take a look at this nvidia paper, they have not tested their stuff on TF, but it is still relevant:
Our results show that GPUs provide state-of-the-art inference
performance and energy efficiency, making them the platform of choice
for anyone wanting to deploy a trained neural network in the field. In
particular, the Titan X delivers between 5.3 and 6.7 times higher
performance than the 16-core Xeon E5 CPU while achieving 3.6 to 4.4
times higher energy efficiency.
The short answer is yes, you'll get roughly the same speedup for running on the GPU after training. With a few minor qualifications.
You're running 2 passes over the data in training, which all happens on the GPU, during the feedforward inference you're doing less work, so there will be more time spent transferring data to the GPU memory relative to computations than in training. This is probably a minor difference though. And you can now asynchronously load the GPU if that's an issue (https://github.com/tensorflow/tensorflow/issues/7679).
Whether you'll actually need a GPU to do inference depends on your workload. If your workload isn't overly demanding you might get away with using the CPU anyway, after all, the computation workload is less than half, per sample, so consider the number of requests per second you'll need to serve and test out whether you overload your CPU to achieve that. If you do, time to get the GPU out!

Why do we need GPU for Deep Learning?

As the question already suggests, I am new to deep learning. I know that the learning process of the model will be slow without GPU. If I am willing to wait, Will it be OK if i use CPU only ?
Many operations which are performed in computing deep learning (and neural networks in general) can be run in parallel, meaning they can be calculated independently then aggregated later. This is, in part, because most of the operations are on vectors.
A typical consumer CPU has between 4 to 8 cores, and hyperthreading allows them to be treated as 8 or 16 respectively. Server CPUs can have between 4 to 24 cores, 8 to 48 threads respectively. Additionally, most modern CPUs have SIMD (single instruction multiple data) extensions which allow them to perform vector operations in parallel on a single thread. Depending on the data type you're working with, an 8 core CPU can perform 8 * 2 * 4 = 64 to 8 * 2 * 8 = 128 vector calculations at once.
Nvidia's new 1080ti has 3584 CUDA cores, which essentially means it can perform 3584 vector calculations at once (hyperthreading and SIMD don't come into play here). That's 56 to 28 times more operations at once than an 8 core CPU. So, whether you're training a single network, or multiples to tune meta-parameters, it will probably be significantly faster on a GPU than a CPU.
Depending on what you are doing, it might take a lot longer. I had 20x speedups be using a GPU. If you read some Computer Vision papers, they train their networks on ImageNet for about 1-2 weeks. Now imagine if that took 20x longer...
Having said that: There are much simpler tasks. For example, for my HASY dataset you can train a reasonable network without a GPU in probably 3 hours. Similar small datasets are MNIST, CIFAR-10, CIFAR-100.
Computationally intensive part of the neural network is multiple matrix multiplications. And how do we make it faster? We can do this by doing all the operations at the same time instead of doing it one after the other. This is in a nutshell why we use GPU (graphics processing units) instead of a CPU (central processing unit).
Google used to have a powerful system, which they had specially built for training huge nets. This system costs $5 billion, with multiple clusters of CPUs.
Few years later, researchers at Stanford built the same system in terms of computation to train their deep nets using GPU. They reduced the costs to $33K. This system was built using GPUs, and it gave the same processing power as Google’s system.
Source: https://www.analyticsvidhya.com/blog/2017/05/gpus-necessary-for-deep-learning/
Deep learning is all about building a mathematical model of the reality or of some kind of part of reality for some kind of specific use by using a lot of training data so you use a lot of training data from the real world that you have collected and then you can train your model so your mathematical model can predict the other outcomes when you give it new data as input so you basically can train this mathematical model but it needs a lot of data and this training needs a lot of computation. So there are lot of computational heavy operations that need to take place and also you need a lot of data. Therefore, for example companies such as Nvidia who are traditionally have been making gaming GPUs for graphics, now they are also having a huge part of the revenue coming from AI and Machine Learning and all of these scientists who want to train their models and you see companies like Google and Facebook, all of them are using GPUs currently to train their ML models.
If you ask this question you probably need a GPU/TPU (Tensor Processing Unit).
You can get one with Google Colab GPU for "free". They have a pretty cool cloud GPU technology
You can stat working with your google accounts
with a Jupiter notebook: https://colab.research.google.com/notebooks/intro.ipynb
Kaggle (Google Owned Data Science competition site) also has this option to create Jupiter notebooks + GPU, but only in limited cases:
notebooks: https://www.kaggle.com/kernels
Documentation for it: https://www.kaggle.com/docs/notebooks