TensorFlow Serving Cluster Architecture - tensorflow

Folks, I am writing an application which will produce recommendations based on ML model call. The application will have different models, some of them should be called in sequence. A data scientist should be able, to upload a model in the system. This means that the application should have logic to store models metadata as well as address of a model server. A model server will be instantiated dynamically on a model upload event.
I would like to use a cluster of TensorFlow Serving here, however I am stacked with a question of architecture.
Is there a way to have something like service registry for TensorFlow servers? What is the best way to build such a cluster of servers with different models?

I need some clarification on what you're trying to do. Is the feature vector for all the models the same? If not then it will be quite a bit harder to do this. Trained models are encapsulated in the SavedModel format. It sounds like you're trying to train an ensemble, but some of the models are frozen? You could certainly write a custom component to make an inference request as part of the input to Trainer, if that's what you need.
UPDATE 1
From your comment below it sounds like what you might be looking for is a service mesh, such as Istio for example. That would help manage the connections between services running inside containers, and the connections between users and services. In this case tf.Serving instances running your models are the services, but the basic request-response pattern is the same. Does that help?

Related

ML model serving with great developer ergonomics

We are looking for ML model serving with a developer experience where the ML engineers don’t need to know Devops.
Ideally we are looking for the following ergonomics or something similar:
Initialize a new model serving end point preferably by a CLI, get a GCS bucket
each time we train a new model, we put it in the GCS bucket of step 1.
The serving system guarantees that the most recent model in the bucket is served unless a model is specified by version number.
We are also looking for a service that optimizes cost and latency.
Any suggestions?
Have you considered https://www.tensorflow.org/tfx/serving/architecture? You can definitely automate the entire workflow using tfx. I think the guide here does a good job walking through it. Depending on your use-case, you may want to use tft instead of Kubeflow like they're doing in that guide. Besides serving automation, you may also want to consider pipeline automation to separate the feature engineering from the pipeline mechanics itself. For example, you can build the pipeline, abstract out the feature engineering into a tensorflow function meeting certain requirements, and automate the deployment process also. This way you don't need to deal with the feature specs/schemas manually, and you know that your transformations are the same during serving as they were while training.
You can do the same thing with scikit-learn also, and I believe serving scikit-learn models is also supported under the vertex-ai umbrella.
To your point about latency, you definitely want the pipeline doing the transformations on the gpu, as such, I would recommend using tensorflow over something like scikit-learn if the use-case is truly time sensitive.
Best of luck!

Serving hundreds of models with Tensorflow serving

I would like to serve about ~600 models with Tensorflow Serving.
I am trying to find a solution to eventually reduce the number of models:
My models have the same architecture, only the weights changes.
Is it possible to load only one model and changing the weights?
Would it be possible to aggregate all those models together and effectively, the first layer of the model would be an ID and the input features for that model?
Has anyone tried having couple of hundreds models running on one machine? I have find this cortex solution, but wanted to avoid using another tech.
https://towardsdatascience.com/how-to-deploy-1-000-models-on-one-cpu-with-tensorflow-serving-ec4297bff54b
If the models have the same architecture but different weight, you can try merging all those model into a "super model". However I would need to know more about the task to see if that's possible.
To serve 600 models, you would need a very powerful machine and lot of memory (depending on how big your models are and how much you use them in parallel).
You can either run TFServe yourself, or use a provider such as Inferrd.com/Google/AWS.

Is it possible to share a Tensorflow model among Gunicorn workers?

I need to put in production a Tensorflow model with a simple APIs endpoint. The model should be shared among processes/workers/threads in order to not waste too many resources in terms of memory.
I already tried with multiple gunicorn workers setting the --preload option and loading the model before the definition of the resources, but after a while I get timeout error. The model is not responding. I know that there is Tensorflow serving service available for this purpose, but the problem is about the fact that the model is not still available at deployment time and the ML pipeline is composed of many different components (the Tensorflow model is just one of these components). The end user (customer) is the one that trains/saves the model. (I am using docker) Thanks in advance
The issue is that the Tensorflow runtime and global state is not fork safe.
See this thread: https://github.com/tensorflow/tensorflow/issues/5448 or this one: https://github.com/tensorflow/tensorflow/issues/51832
You could attempt to load the model only in the workers using the post_fork and post_worker_init server hooks.

Serving multiple deep learning models from cluster

I was thinking about how one should deploy multiple models for use. I am currently dealing with tensorflow. I was referring this and this article.
But I am not able to find any article which targets need to serve several models distributed manner. Q.1. Does tensorflow serving serve models off from single machine? Is there any way to set up a cluster of machines running tensorflow serving? So that multiple machines serve same model somewhat working as master and slave or say load balance between them while serving different models.
Q.2. Does similar functionality exist for other deep learning frameworks, say keras, mxnet etc (not just restricting to tensorflow and serving models from different frameworks)?
A1: Serving tensorflow models in a distributed fashion is made easy with Kubernetes, a container orchestration system, that takes much of the pain related to having distributed system away from you, including load balancing. Please check serving kubernetes.
A2: Sure, check for instance Prediction IO. It's not deep learning specific, but can be used to deploy models made with e.g. Spark MLLib.

Dask, Tensorflow serving (and Kubernetes and Streamz)

What is the current 'state of technology' when having a pipeline composed of python code and Tensorflow/Keras models?
We are trying to have scalability and reactive design using dask and Streamz (for servers registered using Kubernetes).
But currently, we do not know what is the right way to design such infrastructure concerning the fact, that we do want our Tensorflow models to persist and not to be repeatedly created and deleted.
Is Tensorflow serving the technology to be used for this task?
(I was able to find only the basic examples like Persistent dataflows with dask and http://matthewrocklin.com/blog/work/2017/02/11/dask-tensorflow)