Incompatible shapes: [11,768] vs. [1,5,768] - Inference in production with a huggingface saved model - tensorflow-serving

I have saved a pre-trained version of distilbert, distilbert-base-uncased-finetuned-sst-2-english, from huggingface models, and i am attempting to serve it via Tensorflow Serve and make predictions. All is being tested currently in Colab at the moment.
I am having issue getting the prediction into the correct format for the model via TensorFlow Serve. Tensorflow services are up and running fine serving the model, however my prediction code is not correct and i need some help understanding how to make a prediction via json over the API.
# tokenize and encode a simple positive instance
instances = tokenizer.tokenize('this is the best day of my life!')
instances = tokenizer.encode(instances)
data = json.dumps({"signature_name": "serving_default", "instances": instances, })
print(data)
{"signature_name": "serving_default", "instances": [101, 2023, 2003, 1996, 2190, 2154, 1997, 2026, 2166, 999, 102]}
# setup json_response object
headers = {"content-type": "application/json"}
json_response = requests.post('http://localhost:8501/v1/models/my_model:predict', data=data, headers=headers)
predictions = json.loads(json_response.text)
predictions
{'error': '{{function_node __inference__wrapped_model_52602}} {{function_node __inference__wrapped_model_52602}} Incompatible shapes: [11,768] vs. [1,5,768]\n\t [[{{node tf_distil_bert_for_sequence_classification_3/distilbert/embeddings/add}}]]\n\t [[StatefulPartitionedCall/StatefulPartitionedCall]]'}
Any direction here would be appreciated.

Was able to find the solution by setting signatures for input shape and attention mask, which is the following below. This is a simple implementation that uses a fixed input shape for a saved model and requires you to pad the inputs to the expected input shape of 384. I have seen implementations of calling custom signatures and model creation to match expected input shapes, however the below simple case worked for what I was looking to accomplish with serving a huggingface model via TF Serve. If anyone has any better examples or ways to extend this functionality better, please post for future use.
# create callable
from transformers import TFDistilBertForQuestionAnswering
distilbert = TFDistilBertForQuestionAnswering.from_pretrained('distilbert-base-cased-distilled-squad')
callable = tf.function(distilbert.call)
By calling get_concrete_function, we trace-compile the TensorFlow operations of the model for an input signature composed of two Tensors of shape [None, 384], the first one being the input ids and the second one the attention mask.
concrete_function = callable.get_concrete_function([tf.TensorSpec([None, 384], tf.int32, name="input_ids"), tf.TensorSpec([None, 384], tf.int32, name="attention_mask")])
save the model with the signatures:
# stored model path for TF Serve (1 = version 1) --> '/path/to/my/model/distilbert_qa/1/'
distilbert_qa_save_path = 'path_to_model'
tf.saved_model.save(distilbert, distilbert_qa_save_path, signatures=concrete_function)
check to see that it contains the correct signature:
saved_model_cli show --dir 'path_to_model' --tag_set serve --signature_def serving_default
output should look like:
The given SavedModel SignatureDef contains the following input(s):
inputs['attention_mask'] tensor_info:
dtype: DT_INT32
shape: (-1, 384)
name: serving_default_attention_mask:0
inputs['input_ids'] tensor_info:
dtype: DT_INT32
shape: (-1, 384)
name: serving_default_input_ids:0
The given SavedModel SignatureDef contains the following output(s):
outputs['output_0'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 384)
name: StatefulPartitionedCall:0
outputs['output_1'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 384)
name: StatefulPartitionedCall:1
Method name is: tensorflow/serving/predict
TEST MODEL:
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
question, text = "Who was Benjamin?", "Benjamin was a silly dog."
input_dict = tokenizer(question, text, return_tensors='tf')
start_scores, end_scores = distilbert(input_dict)
all_tokens = tokenizer.convert_ids_to_tokens(input_dict["input_ids"].numpy()[0])
answer = ' '.join(all_tokens[tf.math.argmax(start_scores, 1)[0] : tf.math.argmax(end_scores, 1)[0]+1])
FOR TF SERVE (in colab): (which was my original intent with this)
!echo "deb http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | tee /etc/apt/sources.list.d/tensorflow-serving.list && \
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | apt-key add -
!apt update
!apt-get install tensorflow-model-server
import os
# path_to_model --> versions directory --> '/path/to/my/model/distilbert_qa/'
# actual stored model path version 1 --> '/path/to/my/model/distilbert_qa/1/'
MODEL_DIR = 'path_to_model'
os.environ["MODEL_DIR"] = os.path.abspath(MODEL_DIR)
%%bash --bg
nohup tensorflow_model_server --rest_api_port=8501 --model_name=my_model --model_base_path="${MODEL_DIR}" >server.log 2>&1
!tail server.log
MAKE A POST REQUEST:
import json
!pip install -q requests
import requests
import numpy as np
max_length = 384 # must equal model signature expected input value
question, text = "Who was Benjamin?", "Benjamin was a good boy."
# padding='max_length' pads the input to the expected input length (else incompatible shapes error)
input_dict = tokenizer(question, text, return_tensors='tf', padding='max_length', max_length=max_length)
input_ids = input_dict["input_ids"].numpy().tolist()[0]
att_mask = input_dict["attention_mask"].numpy().tolist()[0]
features = [{'input_ids': input_ids, 'attention_mask': att_mask}]
data = json.dumps({ "signature_name": "serving_default", "instances": features})
headers = {"content-type": "application/json"}
json_response = requests.post('http://localhost:8501/v1/models/my_model:predict', data=data, headers=headers)
print(json_response)
predictions = json.loads(json_response.text)['predictions']
all_tokens = tokenizer.convert_ids_to_tokens(input_dict["input_ids"].numpy()[0])
answer = ' '.join(all_tokens[tf.math.argmax(predictions[0]['output_0']) : tf.math.argmax(predictions[0]['output_1'])+1])
print(answer)

Related

RoBERTa example from tfhub produces error "During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string"

I would like to use the roberta-base model from tfhub. I am trying to run the example below, although I get an error when I try to feed sentences to model as input. I get the following error Invalid argument: During Variant Host->Device Copy: non-DMA-copy attempted of tensor type: string. I am using python 3.7, tensorflow 2.5, and tensorflow_hub 0.12.
If I try to replace preprocessor and encoder with the corresponding BERT versions, the code above works. However, I would like it to work for RoBERTa as well (as shown below).
preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4", trainable=True)
# define a text embedding model
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
preprocessor = hub.KerasLayer("https://tfhub.dev/jeongukjae/roberta_en_cased_preprocess/1")
encoder_inputs = preprocessor(text_input)
encoder = hub.KerasLayer("https://tfhub.dev/jeongukjae/roberta_en_cased_L-12_H-768_A-12/1", trainable=True)
encoder_outputs = encoder(encoder_inputs)
pooled_output = encoder_outputs["pooled_output"] # [batch_size, 768].
sequence_output = encoder_outputs["sequence_output"] # [batch_size, seq_length, 768].
model = tf.keras.Model(text_input, pooled_output)
# You can embed your sentences as follows
sentences = tf.constant(["(your text here)"])
print(model(sentences))
Additionally, the code above with the RoBERTa preprocessor/encoder seems to work if I use CPU instead of GPU (adding with tf.device('/cpu:0')), but this is not feasible because I need to fine-tune a model on lots of data.

Deploy keras model use tensorflow serving got 501 Server Error: Not Implemented for url: http://localhost:8501/v1/models/genre:predict

I saved keras .h5 model to .pb using SavedModelBuilder. After I use docker image of tensorflow/serving:1.14.0 deploy my model, when I run predict process, I got the "requests.exceptions.HTTPError: 501 Server Error: Not Implemented for url: http://localhost:8501/v1/models/genre:predict"
The model building code as follows:
from keras import backend as K
import tensorflow as tf
from keras.models import load_model
model=load_model('/home/li/model.h5')
model_signature =
tf.saved_model.signature_def_utils.predict_signature_def(
inputs={'input': model.input}, outputs={'output': model.output})
#export_path = os.path.join(model_path,model_version)
export_path = "/home/li/genre/1"
builder = tf.saved_model.builder.SavedModelBuilder(export_path)
builder.add_meta_graph_and_variables(
sess=K.get_session(),
tags=[tf.saved_model.tag_constants.SERVING],
signature_def_map={
'predict':
model_signature,
'serving_default':
model_signature
})
builder.save()
Then I got the .pb model:
When I run saved_model_cli show --dir /home/li/genre/1 --all, The saved .pd model infomation as follows:
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
signature_def['predict']:
The given SavedModel SignatureDef contains the following input(s):
inputs['input'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 1, 128, 1292)
name: conv2d_1_input_2:0
The given SavedModel SignatureDef contains the following output(s):
outputs['output'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 19)
name: dense_2_2/Softmax:0
Method name is: tensorflow/serving/predict
signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['input'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 1, 128, 1292)
name: conv2d_1_input_2:0
The given SavedModel SignatureDef contains the following output(s):
outputs['output'] tensor_info:
dtype: DT_FLOAT
shape: (-1, 19)
name: dense_2_2/Softmax:0
Method name is: tensorflow/serving/predict
The command I use to deploy on docker image tensorflow/serving is
docker run -p 8501:8501 --name tfserving_genre --mount type=bind,source=/home/li/genre,target=/models/genre -e MODEL_NAME=genre -t tensorflow/serving:1.14.0 &
When open http://localhost:8501/v1/models/genre in browser, I got the message
{
"model_version_status": [
{
"version": "1",
"state": "AVAILABLE",
"status": {
"error_code": "OK",
"error_message": ""
}
}
]
}
The client prediction code as follows:
import requests
import numpy as np
import os
import sys
from audio_to_spectrum_v2 import split_song_to_frames
# Define a Base client class for Tensorflow Serving
class TFServingClient:
"""
This is a base class that implements a Tensorflow Serving client
"""
TF_SERVING_URL_FORMAT = '{protocol}://{hostname}: {port}/v1/models/{endpoint}:predict'
def __init__(self, hostname, port, endpoint, protocol="http"):
self.protocol = protocol
self.hostname = hostname
self.port = port
self.endpoint = endpoint
def _query_service(self, req_json):
"""
:param req_json: dict (as define in https://cloud.google.com/ml-engine/docs/v1/predict-request)
:return: dict
"""
server_url = self.TF_SERVING_URL_FORMAT.format(protocol=self.protocol,
hostname=self.hostname,
port=self.port,
endpoint=self.endpoint)
response = requests.post(server_url, json=req_json)
response.raise_for_status()
print(response.json())
return np.array(response.json()['output'])
# Define a specific client for our inception_v3 model
class GenreClient(TFServingClient):
# INPUT_NAME is the config value we used when saving the model (the only value in the `input_names` list)
INPUT_NAME = "input"
def load_song(self, song_path):
"""Load a song from path,slices to pieces, and extract features, returned as np.array format"""
song_pieces = split_song_to_frames(song_path,False,30)
return song_pieces
def predict(self, song_path):
song_pieces = self.load_song(song_path)
# Create a request json dict
req_json = {
"instances": song_pieces.tolist()
}
print(req_json)
return self._query_service(req_json)
def main():
song_path=sys.argv[1]
print("file name:{}".format(os.path.split(song_path)[-1]))
hostname = "localhost"
port = "8501"
endpoint="genre"
client = GenreClient(hostname=hostname, port=port, endpoint=endpoint)
prediction = client.predict(song_path)
print(prediction)
if __name__=='__main__':
main()
After run the prediction code, I got the error information as follows:
Traceback (most recent call last):
File "client_predict.py", line 90, in <module>
main()
File "client_predict.py", line 81, in main
prediction = client.predict(song_path)
File "client_predict.py", line 69, in predict
return self._query_service(req_json)
File "client_predict.py", line 40, in _query_service
response.raise_for_status()
File "/home/li/anaconda3/lib/python3.7/site-packages/requests/models.py", line 940, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 501 Server Error: Not Implemented for url: http://localhost:8501/v1/models/genre:predict
I wonder what's the reason of this deployment problem, and how to solve it, Thanks for all.
I've tried to print the response use
pred = json.loads(r.content.decode('utf-8'))
print(pred)
The problem is caused by the "conv implementation only supports NHWC tensor format for now."
At last, I change the data format from NCHW to NWHC in Conv2d

Deploying a TensorFlow model on Google Cloud that receives a base64 encoded string as a model input

I have successfully setup Google Cloud and deployed a pre-trained ML model that takes an input tensor (image) of shape=(?, 224, 224, 3) and dtype=float32. It works well but this is inefficient when making REST requests and should really use a base64 encoded string. The challenge is that I am using transfer learning and cannot control the input of the original pre-trained model. To get around this with adding additional infrastructure I created a small graph (wrapper) that handles the base64 to array conversion and connected it to my pre-trained model graph yielding a new single graph. The small graph takes an input tensor with the shape=(), dtype=string and return a tensor with the shape=(224, 224, 3), dtype=float32 which can then be passed to the original model. The model compiles to .pb file without errors and successfully deploys but I get the following error when making my Post request:
{'error': 'Prediction failed: Error during model execution: AbortionError(code=StatusCode.INVALID_ARGUMENT, details="Index out of range using input dim 0; input has only 0 dims\n\t [[{{node lambda/map/while/strided_slice}}]]")'}
Post request body:
{'instances': [{'b64': 'iVBORw0KGgoAAAANSUhEUgAAAOAA...'}]}`
This error leads me to believe the post request is incorrectly formatted for handling the base64 string or my base conversion graph input is setup incorrectly. I can run the code locally by calling predict on my combined model and pass it a tensor in the form of shape=(), dtype=string constructed locally and get a result successfully.
Here is my code for combining the 2 graphs:
import tensorflow as tf
# Local dependencies
from myProject.classifier_models import mobilenet
from myProject.dataset_loader import dataset_loader
from myProject.utils import f1_m, recall_m, precision_m
with tf.keras.backend.get_session() as sess:
def preprocess_and_decode(img_str, new_shape=[224,224]):
#img = tf.io.decode_base64(img_str)
img = tf.image.decode_png(img_str, channels=3)
img = (tf.cast(img, tf.float32)/127.5) - 1
img = tf.image.resize_images(img, new_shape, method=tf.image.ResizeMethod.AREA, align_corners=False)
# If you need to squeeze your input range to [0,1] or [-1,1] do it here
return img
InputLayer = tf.keras.layers.Input(shape = (1,),dtype="string")
OutputLayer = tf.keras.layers.Lambda(lambda img : tf.map_fn(lambda im : preprocess_and_decode(im[0]), img, dtype="float32"))(InputLayer)
base64_model = tf.keras.Model(InputLayer,OutputLayer)
tf.keras.backend.set_learning_phase(0) # Ignore dropout at inference
transfer_model = tf.keras.models.load_model('./trained_model/mobilenet_93.h5', custom_objects={'f1_m': f1_m, 'recall_m': recall_m, 'precision_m': precision_m})
sess.run(tf.global_variables_initializer())
base64_input = base64_model.input
final_output = transfer_model(base64_model.output)
new_model = tf.keras.Model(base64_input,final_output)
export_path = '../myModels/001'
tf.saved_model.simple_save(
sess,
export_path,
inputs={'input_class': new_model.input},
outputs={'output_class': new_model.output})
Tech: TensorFlow 1.13.1 & Python 3.5
I have looked at a bunch of related posts such as:
https://stackoverflow.com/a/50606625
https://stackoverflow.com/a/42859733
http://www.voidcn.com/article/p-okpgbnul-bvs.html (right-click translate to english)
https://cloud.google.com/ml-engine/docs/tensorflow/online-predict
Any suggestions or feedback would be greatly appreciated!
Update 06/12/2019:
Inspecting the 3 graph summaries everything appears correctly merged
Update 06/14/2019:
Ended up going with this alternative strategy instead, implementing a tf.estimator

ML engine serving seems to not be working as intended

While using the following code and doing a gcloud ml-engine local predict I get:
InvalidArgumentError (see above for traceback): You must feed a value
for placeholder tensor 'Placeholder' with dtype string and shape [?]
[[Node: Placeholder = Placeholderdtype=DT_STRING, shape=[?], _device="/job:localhost/replica:0/task:0/device:CPU:0"]] (Error code: 2)
tf_files_path = './tf'
# os.makedirs(tf_files_path) # temp dir
estimator =\
tf.keras.estimator.model_to_estimator(keras_model_path="model_data/yolo.h5",
model_dir=tf_files_path)
#up_one_dir(os.path.join(tf_files_path, 'keras'))
def serving_input_receiver_fn():
def prepare_image(image_str_tensor):
image = tf.image.decode_jpeg(image_str_tensor,
channels=3)
image = tf.divide(image, 255)
image = tf.image.convert_image_dtype(image, tf.float32)
return image
# Ensure model is batchable
# https://stackoverflow.com/questions/52303403/
input_ph = tf.placeholder(tf.string, shape=[None])
images_tensor = tf.map_fn(
prepare_image, input_ph, back_prop=False, dtype=tf.float32)
return tf.estimator.export.ServingInputReceiver(
{model.input_names[0]: images_tensor},
{'image_bytes': input_ph})
export_path = './export'
estimator.export_savedmodel(
export_path,
serving_input_receiver_fn=serving_input_receiver_fn)
The json I am sending to the ml engine looks like this:
{"image_bytes": {"b64": "/9j/4AAQSkZJRgABAQAAAQABAAD/2w..."}}
When not doing a local prediction, but sending it to ML engine itself, I get:
ERROR: (gcloud.ml-engine.predict) HTTP request failed. Response: {
"error": {
"code": 500,
"message": "Internal error encountered.",
"status": "INTERNAL"
}
}
The saved_model_cli gives:
saved_model_cli show --all --dir export/1547848897/
MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:
signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['image_bytes'] tensor_info:
dtype: DT_STRING
shape: (-1)
name: Placeholder:0
The given SavedModel SignatureDef contains the following output(s):
outputs['conv2d_59'] tensor_info:
dtype: DT_FLOAT
shape: (-1, -1, -1, 255)
name: conv2d_59/BiasAdd:0
outputs['conv2d_67'] tensor_info:
dtype: DT_FLOAT
shape: (-1, -1, -1, 255)
name: conv2d_67/BiasAdd:0
outputs['conv2d_75'] tensor_info:
dtype: DT_FLOAT
shape: (-1, -1, -1, 255)
name: conv2d_75/BiasAdd:0
Method name is: tensorflow/serving/predict
Does anyone see what is going wrong here?
The issue has been resolved. The output of the model appeared to be too big for ml-engine to send it back and it didn't capture it in a more relevant exception than 500 internal error. We added some post-processing steps in the model and it works fine now.
For the gcloud ml-engine local predict command that is returning an error, it seems to be a bug. As the model works on ml-engine now, but still does return this error with local prediction.

sagemaker invoke_endpoint signature_def feature prep

I have a sagemaker tensorflow model using a custom estimator, similar to the abalone.py sagemaker tensorflow example, using build_raw_serving_input_receiver_fn in the serving_input_fn:
def serving_input_fn(params):
tensor = tf.placeholder(tf.float32, shape=[1, NUM_FEATURES])
return build_raw_serving_input_receiver_fn({INPUT_TENSOR_NAME: tensor})()
Predictions are being request from java-script using json:
response = #client.invoke_endpoint(
endpoint_name: #name,
content_type: "application/json",
accept: "application/json",
body: values.to_json
)
Everything fine so far. Now I want to add some feature engineering (scaling transformations on the features using a scaler derived from the training data). Following the pattern of the answer for Data Normalization with tensorflow tf-transform
I've now got serving_input_fn like this:
def serving_input_fn(params):
feature_placeholders = {
'f1': tf.placeholder(tf.float32, [None]),
'f2': tf.placeholder(tf.float32, [None]),
'f3': tf.placeholder(tf.float32, [None]),
}
features = {
key: tf.expand_dims(tensor, -1)
for key, tensor in feature_placeholders.items()
}
return tf.estimator.export.ServingInputReceiver(add_engineering(features), feature_placeholders)
From saved_model_cli show --dir . --all I can see the input signature has changed:
signature_def['serving_default']:
The given SavedModel SignatureDef contains the following input(s):
inputs['f1'] tensor_info:
dtype: DT_FLOAT
shape: (-1)
name: Placeholder_1:0
inputs['f2'] tensor_info:
dtype: DT_FLOAT
shape: (-1)
name: Placeholder_2:0
inputs['f3'] tensor_info:
dtype: DT_FLOAT
shape: (-1)
name: Placeholder:0
How do I prepare features for prediction from this new model? In python I've been unsuccessfully trying things like
requests = [{'f1':[0.1], 'f2':[0.1], 'f3':[0.2]}]
predictor.predict(requests)
also need to send prediction requests from java-script.
You can define an
def input_fn(data=None, content_type=None):
This would be called directly when a call is made to SageMaker. You can do your feature preparation in this function. model_fn would be called after this function.
Make sure you return a dict of string and TensorProto.
dict{"input tensor name", TensorProto} from the input_fn method.
You can find more details where
https://docs.aws.amazon.com/sagemaker/latest/dg/tf-training-inference-code-template.html
A sample input_fn would look something like below
def input_fn(data=None, content_type=None):
"""
Args:
data: An Amazon SageMaker InvokeEndpoint request body
content_type: An Amazon SageMaker InvokeEndpoint ContentType value for data.
Returns:
object: A deserialized object that will be used by TensorFlow serving as input.
"""
# `inputs` is based on the parameters defined in the model spec's signature_def
return {"inputs": tf.make_tensor_proto(data, shape=(1,))}
Have managed to make the feature values available on their way into prediction via a sagemaker input_fn definition, as suggested by Raman. It means going back to the build_raw_serving_input_receiver_fn serving_input_fn I started with (top of post). The input_fn looks like this:
def input_fn(data=None, content_type=None):
if content_type == 'application/json':
values = np.asarray(json.loads(data))
return {"inputs": tf.make_tensor_proto(values=values, shape=values.shape, dtype=tf.float32)}
else:
return {"inputs": data}
Although I can't pass e.g. a scaler from training into this procedure, it will probably work to embed it in the model.py file that sagemaker requires (which contains this input_fn defn). What I have is responding correctly to by addressed from python either by
data = [[0.1, 0.2, 0.3]]
payload = json.dumps(data)
response = client.invoke_endpoint(
EndpointName=endpoint_name,
Body=payload,
ContentType='application/json'
)
result = json.loads(response['Body'].read().decode())
or
values = np.asarray([[0.1, 0.2, 0.3]])
prediction = predictor.predict(values)
This is all new to me... please recommend improvements/alert me to potential problems if you know of any.