Serving multiple deep learning models from cluster - tensorflow

I was thinking about how one should deploy multiple models for use. I am currently dealing with tensorflow. I was referring this and this article.
But I am not able to find any article which targets need to serve several models distributed manner. Q.1. Does tensorflow serving serve models off from single machine? Is there any way to set up a cluster of machines running tensorflow serving? So that multiple machines serve same model somewhat working as master and slave or say load balance between them while serving different models.
Q.2. Does similar functionality exist for other deep learning frameworks, say keras, mxnet etc (not just restricting to tensorflow and serving models from different frameworks)?

A1: Serving tensorflow models in a distributed fashion is made easy with Kubernetes, a container orchestration system, that takes much of the pain related to having distributed system away from you, including load balancing. Please check serving kubernetes.
A2: Sure, check for instance Prediction IO. It's not deep learning specific, but can be used to deploy models made with e.g. Spark MLLib.

Related

What's different between TFServing and KFServing on KubeFlow

TFServin and KFServing both deploy the model on Kubeflow, and let users easy to use the model as a service, don't need to know detail about Kubernetes, hiding the infra layers.
TFServing is from TensorFlow, it can also run on Kubeflow or standalone. TFserving on kubeflow
KFServing is from Kubeflow, which can support multiple frameworks like PyTorch, TensorFlow, MXNet, etc. KFServing
My question is what's the main difference between these two projects.
If I want to launch my model in production, which should I use? which has better performance?
KFServing is an abstraction on top of inferencing rather than a replacement. It seeks to simplify deployment and make inferencing clients agnostic to what inference server is doing the actual work behind the scenes (be it TF Serving, Triton (formerly TRT-IS), Seldon, etc). It does this by seeking agreement among inference server vendors on an inferencing dataplane specification which allows extra components (such as transformations and explainers) to be more pluggable.

TensorFlow Serving Cluster Architecture

Folks, I am writing an application which will produce recommendations based on ML model call. The application will have different models, some of them should be called in sequence. A data scientist should be able, to upload a model in the system. This means that the application should have logic to store models metadata as well as address of a model server. A model server will be instantiated dynamically on a model upload event.
I would like to use a cluster of TensorFlow Serving here, however I am stacked with a question of architecture.
Is there a way to have something like service registry for TensorFlow servers? What is the best way to build such a cluster of servers with different models?
I need some clarification on what you're trying to do. Is the feature vector for all the models the same? If not then it will be quite a bit harder to do this. Trained models are encapsulated in the SavedModel format. It sounds like you're trying to train an ensemble, but some of the models are frozen? You could certainly write a custom component to make an inference request as part of the input to Trainer, if that's what you need.
UPDATE 1
From your comment below it sounds like what you might be looking for is a service mesh, such as Istio for example. That would help manage the connections between services running inside containers, and the connections between users and services. In this case tf.Serving instances running your models are the services, but the basic request-response pattern is the same. Does that help?

Is it possible to share a Tensorflow model among Gunicorn workers?

I need to put in production a Tensorflow model with a simple APIs endpoint. The model should be shared among processes/workers/threads in order to not waste too many resources in terms of memory.
I already tried with multiple gunicorn workers setting the --preload option and loading the model before the definition of the resources, but after a while I get timeout error. The model is not responding. I know that there is Tensorflow serving service available for this purpose, but the problem is about the fact that the model is not still available at deployment time and the ML pipeline is composed of many different components (the Tensorflow model is just one of these components). The end user (customer) is the one that trains/saves the model. (I am using docker) Thanks in advance
The issue is that the Tensorflow runtime and global state is not fork safe.
See this thread: https://github.com/tensorflow/tensorflow/issues/5448 or this one: https://github.com/tensorflow/tensorflow/issues/51832
You could attempt to load the model only in the workers using the post_fork and post_worker_init server hooks.

Dask, Tensorflow serving (and Kubernetes and Streamz)

What is the current 'state of technology' when having a pipeline composed of python code and Tensorflow/Keras models?
We are trying to have scalability and reactive design using dask and Streamz (for servers registered using Kubernetes).
But currently, we do not know what is the right way to design such infrastructure concerning the fact, that we do want our Tensorflow models to persist and not to be repeatedly created and deleted.
Is Tensorflow serving the technology to be used for this task?
(I was able to find only the basic examples like Persistent dataflows with dask and http://matthewrocklin.com/blog/work/2017/02/11/dask-tensorflow)

distributed tensorflow clarification

Is my understanding correct that model_deploy lets the user train a model using multiple devices on a single machine? The basic premise seems that the clone devices do variable sharing and variables get distributed to param servers in a round-robin fashion.
On the other hand distributed tensorflow framework enables the user to train a model through a cluster. A Cluster lets the user train a model using multiple devices across multiple servers.
I think the Slim documentation is very slim and the point has been raised couple of times already: Configuration/Flags for TF-Slim across multiple GPU/Machines
Thank you.