How do a tf-serving which use GPU's concurrency work? - tensorflow

I'm using tf-serving to serve multiple models, which deploy in an A30 GPU. I want to analyze its scalability, so I want to know how A tf-serving with A30 GPU works? If there are multiple requests for different models with different input, how tf-serving handle such situation?

Related

Serving hundreds of models with Tensorflow serving

I would like to serve about ~600 models with Tensorflow Serving.
I am trying to find a solution to eventually reduce the number of models:
My models have the same architecture, only the weights changes.
Is it possible to load only one model and changing the weights?
Would it be possible to aggregate all those models together and effectively, the first layer of the model would be an ID and the input features for that model?
Has anyone tried having couple of hundreds models running on one machine? I have find this cortex solution, but wanted to avoid using another tech.
https://towardsdatascience.com/how-to-deploy-1-000-models-on-one-cpu-with-tensorflow-serving-ec4297bff54b
If the models have the same architecture but different weight, you can try merging all those model into a "super model". However I would need to know more about the task to see if that's possible.
To serve 600 models, you would need a very powerful machine and lot of memory (depending on how big your models are and how much you use them in parallel).
You can either run TFServe yourself, or use a provider such as Inferrd.com/Google/AWS.

Tensorflow Serving Performance Very Slow vs Direct Inference

I am running in the following scenario:
Single Node Kubernetes Cluster (1x i7-8700K, 1x RTX 2070, 32GB RAM)
1 Tensorflow Serving Pod
4 Inference Client Pods
What the inference clients do is they get images from 4 separate cameras (1 each) and pass it to TF-Serving for inference in order to get the understanding of what is seen on the video feeds.
I have previously been doing inference inside the Inference Client Pods individually by calling TensorFlow directly but that hasn't been good on the RAM of the graphics card. Tensorflow Serving has been introduced to the mix quite recently in order to optimize RAM as we don't load duplicated models to the graphics card.
And the performance is not looking good, for a 1080p images it looks like this:
Direct TF: 20ms for input tensor creation, 70ms for inference.
TF-Serving: 80ms for GRPC serialization, 700-800ms for inference.
The TF-Serving pod is the only one that has access to the GPU and it is bound exclusively. Everything else operates on CPU.
Are there any performance tweaks I could do?
The model I'm running is Faster R-CNN Inception V2 from the TF Model Zoo.
Many thanks in advance!
This is from TF Serving documentation:
Please note, while the average latency of performing inference with TensorFlow Serving is usually not lower than using TensorFlow directly, where TensorFlow Serving shines is keeping the tail latency down for many clients querying many different models, all while efficiently utilizing the underlying hardware to maximize throughput.
From my own experience, I've found TF Serving to be useful in providing an abstraction over model serving which is consistent, and does not require implementing custom serving functionalities. Model versioning and multi-model which come out-of-the-box save you lots of time and are great additions.
Additionally, I would also recommend batching your requests if you haven't already. I would also suggest playing around with the TENSORFLOW_INTER_OP_PARALLELISM, TENSORFLOW_INTRA_OP_PARALLELISM, OMP_NUM_THREADS arguments to TF Serving. Here is an explanation of what they are
Maybe you could try OpenVINO? It's a heavily optimized toolkit for inference. You could utilize your i7-8700K and run some frames in parallel. Here are some performance benchmarks for very similar i7-8700T.
There is even OpenVINO Model Server which is very similar to Tensorflow Serving.
Disclaimer: I work on OpenVINO.

How to do tensorflow inference with multiple models?

Suppose I have 2 tf.keras models that work with images, and I want to use these models inside my C++ standalone program which reads images from 2 cameras continuously and run the model on each frame. Each camera has a different frame rate, so both models must be run in parallel, they need to utilize GPU also.
Each model predicts different things.
The C++ app is only on one desktop PC, it does not need to serve the predictions to many people, so tensorflow serving might be overkill or not fast? I am suspecting it is not fast because I saw that it's using JSON to encode predictions, which might be slow to send or receive.
I have saved both models as SavedModel format and it can be predicted fine in python.
How can I load these 2 models and predict in parallel utilizing one GPU, and as fast as possible in C++ app? I only care about the case when the prediction is a single image, or 2 images, not a batch like 32 images. If there is a way to make tensorflow serving or lite works fast, I am also cool with that.
Bonus: the method should not require difficult installation that users have to do. Think that in the end user will have to install my app and they have to manually install tensorflow dependencies? That's impractical.

Tensorflow serving SharedBatchScheduler for golang

I want to optimize serving tensorflow model in Golang and study tensorflow serving.
servers-with-multiple-models-model-versions-or-subtasks
SharedBatchScheduler presents an abstraction of queues, accepts requests to schedule a particular kind of task. Each batch contains tasks of just one type, i.e. from one queue. The scheduler ensures fairness by interleaving the different types of batches.
Is there any Golang repo working on SharedBatchScheduler? Or I have to rewrite the tensorflow serving BatchingSession C++ source code to Golang version?

Serving multiple deep learning models from cluster

I was thinking about how one should deploy multiple models for use. I am currently dealing with tensorflow. I was referring this and this article.
But I am not able to find any article which targets need to serve several models distributed manner. Q.1. Does tensorflow serving serve models off from single machine? Is there any way to set up a cluster of machines running tensorflow serving? So that multiple machines serve same model somewhat working as master and slave or say load balance between them while serving different models.
Q.2. Does similar functionality exist for other deep learning frameworks, say keras, mxnet etc (not just restricting to tensorflow and serving models from different frameworks)?
A1: Serving tensorflow models in a distributed fashion is made easy with Kubernetes, a container orchestration system, that takes much of the pain related to having distributed system away from you, including load balancing. Please check serving kubernetes.
A2: Sure, check for instance Prediction IO. It's not deep learning specific, but can be used to deploy models made with e.g. Spark MLLib.