How to serve Generative Language Models via TensorFlow Serving

How to serve Generative Language Models via TensorFlow Serving - tensorflow

Generative Language Models like GPT, seq2seq or chatbots based on transformers have their decoder layers and its inputs.
The model generates words until they meet [EOS] tokens by repeating feeding outputs fro m the decoder as inputs to the decoder back.
However, it seems there is no mechanism for this in TensorFlow Serving.
If I want to use Generative Language Models using TensorFlow Serving, Should I call gRPC prediction request every time generate a word until [EOS] token?
If I should, how can I reduce overheads from gRPC API call?
Are there other recommendations for this kind of models?

Related

Online Predictions for Keras model via API

I have an image classification deep learning CNN model (.h5 file) trained using Keras and Tensorflow 2 that I want to use online for predictions. I want an API that takes the single input image over HTTP and responds with the predicted class labels using the trained model. Is there an API provided by Keras or Tensorflow to do the same?

There's two basic options:
Use TensorFlow Serving - it provides ready-to-go REST API server, the only thing that you need to do is to convert your model to .pb format.
Write your own simple REST server (on Flask, for example) which will call model.predict() on the inputs (that approach may be easier to start with, but it will be hard to scale/optimize for heavy load.

Speed-up tensorflowjs model

I trained a model using mxnet framework. The inference time for the model is ~ 9 milliseconds.
The model mainly consists of conv layers and uses depthwise separable convolution.
I want to run that model in browser. I converted the model to ONNX format then from
ONNX -> tensorflow -> tensorflowjs.
The inference time for tensorflowjs model ~129 milliseconds.
Any suggestion to improve the performance for the model?
I have also tried ONNXJS but it seems it still has few bugs.

Re-architecting would be a possibility since you're dealing with 129ms latency. You would have time to send images to an endpoint (EC2, or SageMaker + API Gateway) running a performant inference server.
Vishaal

Should I use the standalone Keras library or tf.keras?

As Keras becomes an API for TensorFlow, there are lots of old versions of Keras code, such as https://github.com/keiserlab/keras-neural-graph-fingerprint/blob/master/examples.py
from keras import models
With the current version of TensorFlow, do we need to change every Keras code as?
from tensorflow.keras import models

You are mixing things up:
Keras (https://keras.io/) is a library independent from TensorFlow, which specifies a high-level API for building and training neural networks and is capable of using one of multiple backends (among which, TensorFlow) for low-level tensor computation.
tf.keras (https://www.tensorflow.org/guide/keras) implements the Keras API specification within TensorFlow. In addition, the tf.keras API is optimized to work well with other TensorFlow modules: you can pass a tf.data Dataset to the .fit() method of a tf.keras model, for instance, or convert a tf.keras model to a TensorFlow estimator with tf.keras.estimator.model_to_estimator. Currently, the tf.keras API is the high-level API to look for when building models within TensorFlow, and the integration with other TensorFlow features will continue in the future.
So to answer your question: no, you don't need to convert Keras code to tf.keras code. Keras code uses the Keras library, potentially even runs on top of a different backend than TensorFlow, and will continue to work just fine in the future. Even more, it's important to not just mix up Keras and tf.keras objects within the same script, since this might produce incompatabilities, as you can see for example in this question.
Update: Keras will be abandoned in favor of tf.keras: https://twitter.com/fchollet/status/1174019423541157888

Why use keras as backend instead of using tensorflow?

I see that there are many similar functions between tensorflow and keras like argmax, boolean_mask...I wonder why people have to use keras as backend along with tensorflow instead of using tensorflow alone.

Keras is not a backend, but it is a high-level API for building and training Neural Networks. Keras is capable of running on top of Tensorflow, Theano and CNTK. Most of the people prefer Keras due to its simplicity compared to other libraries like Tensorflow. I recommend Keras for beginners in Deep Learning.

A Keras tensor is a tensor object from the underlying backend (Theano,
TensorFlow or CNTK), which we augment with certain attributes that
allow us to build a Keras model just by knowing the inputs and outputs
of the model.
Theano vs Tensorflow
Tensorflow is necessary if you wish to use coremltools. Apple has promised support for architectures created using Theano but I haven't seen it yet.
Keras will require unique syntax sugar depending on the backend in use. I like the flexibility of Tensorflow input layers and easy-access to strong Google neural networks.

What's are the differences between tensorflow_serving classification, predict and regression SignatureDefs

I am trying to serve the tensorflow object detection api model in tensorflow serving, And I am confused by the 3 different SignatureDefs. What are the differences, When to choose one over another?

Tensorflow Serving uses a different way of updating models weights and different signature mechanism is used in serving. In order to save model in serving se uses SavedModel. SavedModel provides a language-neutral format to save machine-learned models that is recoverable and hermetic. It enables higher-level systems and tools to produce, consume and transform TensorFlow models.
This support SignatureDefs
Graphs that are used for inference tasks typically have a set of inputs and outputs. This is called a Signature.
SavedModel uses SignatureDefs to allow generic support for signatures that may need to be saved with the graphs.
For those who previously used TF-Exporter/SessionBundle, Signatures in TF-Exporter will be replaced by SignatureDefs in SavedModel.
A SignatureDef requires specification of:
inputs as a map of string to TensorInfo.
outputs as a map of string to TensorInfo.
method_name (which corresponds to a supported method name in the loading tool/system).
Classification SignatureDefs support structured calls to TensorFlow Serving's Classification API. These prescribe that there must be an inputs Tensor, and that there are two optional output Tensors: classes and scores, at least one of which must be present.
Predict SignatureDefs support calls to TensorFlow Serving's Predict API. These signatures allow you to flexibly support arbitrarily many input and output Tensors. For the example below, the signature my_prediction_signature has a single logical input Tensor images that are mapped to the actual Tensor in your graph x:0.
Regression SignatureDefs support structured calls to TensorFlow Serving's Regression API. These prescribe that there must be exactly one inputs Tensor, and one outputs Tensor.
Please refer:
https://www.tensorflow.org/serving/signature_defs

https://github.com/tensorflow/serving/issues/599
The Classify API is higher-level and more specific than the Predict API. Classify accepts tensorflow.serving.Input (which wraps a list of tf.Examples) as input and produces classes and scores as output. It is used for classification problems. Predict, on the other than, accepts tensors as input and outputs tensors. It can be used for regression, classification and other types of inference problems.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas