Deploy a Tensorflow model built using TF2 in TF1 format (no SavedModel bundles found!) - tensorflow

I have used Recommenders https://github.com/microsoft/recommenders library to train an NCF recommendation model. Currently I'm getting issues in deployment through Amazon TensorflowModel library
Model is saved using the following code
def save(self, dir_name):
"""Save model parameters in `dir_name`
Args:
dir_name (str): directory name, which should be a folder name instead of file name
we will create a new directory if not existing.
"""
# save trained model
if not os.path.exists(dir_name):
os.makedirs(dir_name)
saver = tf.compat.v1.train.Saver()
saver.save(self.sess, os.path.join(dir_name, MODEL_CHECKPOINT))
Files exported in the process are 'checkpoint', 'model.ckpt.data-00000-of-00001', 'model.ckpt.index', 'model.ckpt.meta'
They follow the structure of
- model.tar.gz
- 00000000
- checkpoint
- model.ckpt.data-00000-of-00001
- model.ckpt.index
- model.ckpt.meta
I have tried various deployment processes, however they all give the same error. Here's the latest one that I implemented following this example https://github.com/aws/amazon-sagemaker-examples/blob/main/sagemaker-script-mode/pytorch_bert/code/inference_code.py
from sagemaker.tensorflow.model import TensorFlowModel
model = TensorFlowModel(
entry_point="tf_inference.py",
model_data=zipped_model_path,
role=role,
model_version='1',
framework_version="2.7"
)
predictor = model.deploy(
initial_instance_count=1, instance_type="ml.g4dn.2xlarge", endpoint_name='endpoint-name3'
)
All Solutions end with the same error over and over again
Traceback (most recent call last):
File "/sagemaker/serve.py", line 502, in <module>
ServiceManager().start()
File "/sagemaker/serve.py", line 482, in start
self._create_tfs_config()
File "/sagemaker/serve.py", line 153, in _create_tfs_config
raise ValueError("no SavedModel bundles found!")

These 2 links helped me resolve the issue
https://github.com/aws/sagemaker-python-sdk/issues/599
https://www.tensorflow.org/guide/migrate/saved_model#1_save_the_graph_as_a_savedmodel_with_savedmodelbuilder
Sagemaker has weird directory structure that you need to strictly follow. The first one shares the starting directories and 2nd one shares the process of saving the model for TF1 and TF2

Related

How can I use a Cloud TPU with Tensorflow Lite Model Maker?

I'm training an object detection model (EfficientDet-Lite) using Tensorflow Lite Model Maker in Colab and I'd like to use a Cloud TPU. I have all the images in a GCS bucket and provide a CSV file. When I call object_detector.create I get the following error:
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in shape(self)
1196 # `_tensor_shape` is declared and defined in the definition of
1197 # `EagerTensor`, in C.
-> 1198 self._tensor_shape = tensor_shape.TensorShape(self._shape_tuple())
1199 except core._NotOkStatusException as e:
1200 six.raise_from(core._status_to_exception(e.code, e.message), None)
InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /tmp/tfhub_modules/db7544dcac01f8894d77bea9d2ae3c41ba90574c/variables/variables: Unimplemented: File system scheme '[local]' not implemented (file: '/tmp/tfhub_modules/db7544dcac01f8894d77bea9d2ae3c41ba90574c/variables/variables')
That looks like it's trying to process some local files in the CloudTPU, which doesn't work...
The gist of what I'm doing is:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
train_data, validation_data, test_data = object_detector.DataLoader.from_csv(
drive_dir + csv_name,
images_dir = "images" if not tpu else None,
cache_dir = drive_dir + "cub_cache",
)
spec = MODEL_SPEC(tflite_max_detections=10, strategy='tpu', tpu=tpu.master(), gcp_project="xxx")
model = object_detector.create(train_data=train_data,
model_spec=spec,
validation_data=validation_data,
epochs=epochs,
batch_size=batch_size,
train_whole_model=True)
I can't find any example with Model Maker that uses Cloud TPU.
Edit: the error seems to occur when the EfficientDet model gets loaded, so somehow modelmaker must be pointing to a local file that doesn't work for CloudTPU?
Yeah the error is happening with TFHub, which seems to be well known. Basically TF Hub loading tries to use a local cache which TPU doesn't have access to (and the Colab doesn't even provide). Check out https://github.com/tensorflow/hub/issues/604 which should get you past this error.
Download from TF-Hub the model you would like to train (replace X: 0<=X<=4):
https://tfhub.dev/tensorflow/efficientdet/liteX/feature-vector/1
Extract the package twice until you get to the "keras_metadata.pb", "saved_model.pb" and "variables" folder
Upload these files and folders on a Google Cloud Bucket
Pass the uri argument to model_spec.get (https://www.tensorflow.org/lite/tutorials/model_maker_object_detection), pointing to the Cloud Bucket folder (in gs:// format)

Trouble running converted custom trained model on RPI4 using TFLite

I am trying to run a mask detecting model on my RPI4.
I followed this Roboflow tutorial:
https://www.youtube.com/watch?v=pXLLNa4IrmM&list=LL&index=1&t=1169s.
However, after converting the darknet model to a TFLite model, I am getting this error:
2021-01-30 21:42:03.351149: E
tensorflow/core/platform/hadoop/hadoop_file_system.cc:132]
HadoopFileSystem load error: libhdfs.so: cannot open shared object
file: No such file or directory Traceback (most recent call last):
File "TFLite_detection_webcam.py", line 138, in
interpreter = Interpreter(model_path=PATH_TO_CKPT) File
"/home/pi/.local/lib/python3.7/site-packages/tensorflow_core/lite/python/interpreter.py",
line 207, in init model_path, self._custom_op_registerers))
ValueError: Didn't find op for builtin opcode 'RESIZE_BILINEAR'
version '3' Registration failed.
I know I am probably missing information that could help solve this problem, but I am not sure what else I should be including.
I was able to run premade object models, however the custom model I made using the YouTube guide did not work.
How do I solve this issue? Or, is there another way to train a custom model to work on the RPI4?
I think you are using a too old version of Tensorflow. Can you try the latest version of nightly?

Tf 2.0 MirroredStrategy on Albert TF Hub model (multi gpu)

I'm trying to run Albert Tensorflow hub version on multiple GPUs in the same machine. The model works perfectly on single GPU.
This is the structure of my code:
strategy = tf.distribute.MirroredStrategy()
print('Number of devices: {}'.format(strategy.num_replicas_in_sync)) # it prints 2 .. correct
if __name__ == "__main__":
with strategy.scope():
run()
Where in run() function, I read the data, build the model, and fit it.
I'm getting this error:
Traceback (most recent call last):
File "Albert.py", line 130, in <module>
run()
File "Albert.py", line 88, in run
model = build_model(bert_max_seq_length)
File "Albert.py", line 55, in build_model
model.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])
File "/home/****/py_transformers/lib/python3.5/site-packages/tensorflow_core/python/training/tracking/base.py", line 457, in _method_wrapper
result = method(self, *args, **kwargs)
File "/home/bighanem/py_transformers/lib/python3.5/site-packages/tensorflow_core/python/keras/engine/training.py", line 471, in compile
' model.compile(...)'% (v, strategy))
ValueError: Variable (<tf.Variable 'bert/embeddings/word_embeddings:0' shape=(30000, 128) dtype=float32>) was not created in the distribution strategy scope of (<tensorflow.python.distribute.mirrored_strategy.MirroredStrategy object at 0x7f62e399df60>). It is most likely due to not all layers or the model or optimizer being created outside the distribution strategy scope. Try to make sure your code looks similar to the following.
with strategy.scope():
model=_create_model()
model.compile(...)
Is it possible that this error occures because Albert model was prepared before by tensorflow team (built and compiled)?
Edited:
To be precise, Tensorflow version is 2.1.
Also, this is the way I load Albert pretrained model:
features = {"input_ids": in_id, "input_mask": in_mask, "segment_ids": in_segment, }
albert = hub.KerasLayer(
"https://tfhub.dev/google/albert_xxlarge/3",
trainable=False, signature="tokens", output_key="pooled_output",
)
x = albert(features)
Following this tutorial: SavedModels from TF Hub in TensorFlow 2
Two-part answer:
1) TF Hub hosts two versions of ALBERT (each in several sizes):
https://tfhub.dev/google/albert_base/3 etc. from the Google research team that originally developed ALBERT comes in the hub.Module format for TF1. This will likely not work with a TF2 distribution strategy.
https://tfhub.dev/tensorflow/albert_en_base/1 etc. from the TensorFlow Model Garden comes in the revised TF2 SavedModel format. Please try this one for use in TF2 with a distribution strategy.
2) That said, the immediate problem appears to be what is explained in the error message (abridged):
Variable 'bert/embeddings/word_embeddings' was not created in the distribution strategy scope ... Try to make sure your code looks similar to the following.
with strategy.scope():
model = _create_model()
model.compile(...)
For a SavedModel (from TF Hub or otherwise), it's the loading that needs to happen under the distribution strategy scope, because that's what's re-creating the tf.Variable objects in the current program. Specifically, any of the following ways to load a TF2 SavedModel from TF Hub have to occur under the distribution strategy scope for distribution to work:
tf.saved_model.load();
hub.load(), which just calls tf.saved_model.load() (after downloading if necessary);
hub.KerasLayer when used with a string-valued model handle, on which it then calls hub.load().

Save Keras ModelCheckpoints in Google Cloud Bucket

I'm working on training a LSTM network on Google Cloud Machine Learning Engine using Keras with TensorFlow backend. I managed it to deploy my model and perform a successful training task after some adjustments to the gcloud and my python script.
I then tried to make my model save checkpoints after every epoch using Keras modelCheckpoint callback. Running a local training job with Google Cloud works perfectly as expected. The weights are getting stored in the specified path after each epoch. But when I try to run the same job online on Google Cloud Machine Learning Engine the weights.hdf5 does not get written to my Google Cloud Bucket. Instead I get the following error:
...
File "h5f.pyx", line 71, in h5py.h5f.open (h5py/h5f.c:1797)
IOError: Unable to open file (Unable to open file: name =
'gs://.../weights.hdf5', errno = 2, error message = 'no such file or
directory', flags = 0, o_flags = 0)
I investigated this issue and it turned out, that there is no Problem with the the Bucket itself, as Keras Tensorboard callback does work fine and writes the expected output to the same bucket. I also made sure that h5py gets included by providing it in the setup.py located at:
├── setup.py
└── trainer
├── __init__.py
├── ...
The actual include in setup.py is shown below:
# setup.py
from setuptools import setup, find_packages
setup(name='kerasLSTM',
version='0.1',
packages=find_packages(),
author='Kevin Katzke',
install_requires=['keras','h5py','simplejson'],
zip_safe=False)
I guess the problem comes down to the fact that GCS cannot be accessed with Pythons open for I/O since it instead provides a custom implementation:
import tensorflow as tf
from tensorflow.python.lib.io import file_io
with file_io.FileIO("gs://...", 'r') as f:
f.write("Hi!")
After checking how Keras modelCheckpoint callback implements the actual file writing and it turned out, that it is using h5py.File() for I/O:
with h5py.File(filepath, mode='w') as f:
f.attrs['keras_version'] = str(keras_version).encode('utf8')
f.attrs['backend'] = K.backend().encode('utf8')
f.attrs['model_config'] = json.dumps({
'class_name': model.__class__.__name__,
'config': model.get_config()
}, default=get_json_type).encode('utf8')
And as the h5py package is a Pythonic interface to the HDF5 binary data format the h5py.File() seems to call an underlying HDF5 functionality written in Fortran as far as I can tell: source, documentation.
How can I fix this and make the modelCheckpoint callback write to my GCS Bucket? Is there a way for "monkey patching" to somehow overwrite how a hdf5 file is opened to make it use GCS's file_io.FileIO()?
I might be a bit late on this, but for the sake of future visitors I would describe the whole process of how to adapt the code that was previously run locally to be GoogleML-aware from the IO point of view.
Python standard open(file_name, mode) does not work with buckets (gs://...../file_name). One needs to from tensorflow.python.lib.io import file_io and change all calls to open(file_name, mode) to file_io.FileIO(file_name, mode=mode) (note the named mode parameter). The interface of the opened handle is the same.
Keras and/or other libraries mostly use standard open(file_name, mode) internally. That said, something like trained_model.save(file_path) calls to 3rd-party libraries will fail to store the result to the bucket. The only way to retrieve a model after the job has finished successfully would be to store it locally and then move to the bucket.
The code below is quite inefficient, because it loads the whole model at once and then dumps it to the bucket, but it worked for me for relatively small models:
model.save(file_path)
with file_io.FileIO(file_path, mode='rb') as if:
with file_io.FileIO(os.path.join(model_dir, file_path), mode='wb+') as of:
of.write(if.read())
The mode must be set to binary for both reading and writing.
When the file is relatively big, it makes sense to read and write it in chunks to decrease memory consumption.
Before running a real task, I would advise to run a stub that simply saves a file to remote bucket.
This implementation, temporarily put instead of real train_model call, should do:
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--job-dir',
help='GCS location with read/write access',
required=True
)
args = parser.parse_args()
arguments = args.__dict__
job_dir = arguments.pop('job_dir')
with file_io.FileIO(os.path.join(job_dir, "test.txt"), mode='wb+') as of:
of.write("Test passed.")
After a successful execution you should see the file test.txt with a content "Test passed." in your bucket.
The issue can be solved with the following piece of code:
# Save Keras ModelCheckpoints locally
model.save('model.h5')
# Copy model.h5 over to Google Cloud Storage
with file_io.FileIO('model.h5', mode='r') as input_f:
with file_io.FileIO('model.h5', mode='w+') as output_f:
output_f.write(input_f.read())
print("Saved model.h5 to GCS")
The model.h5 is saved on local filesystem and the copied over to GCS. As Jochen pointed out, there currently is no easy support to write HDF5 model checkpoints to GCS. With this hack it is possible to write the data until an easier solution is provided.
I faced a similar problem and the solution above didn't work for me. The file must be read and be written in binary form. Otherwise this error will be thrown.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
So the code will be
def copy_file_to_gcs(job_dir, file_path):
with file_io.FileIO(file_path, mode='rb') as input_f:
with file_io.FileIO(os.path.join(job_dir, file_path), mode='wb+') as output_f:
output_f.write(input_f.read())
Here is my code that I wrote to save the model after each epoch.
import os
import numpy as np
import warnings
from keras.callbacks import ModelCheckpoint
class ModelCheckpointGC(ModelCheckpoint):
"""Taken from and modified:
https://github.com/keras-team/keras/blob/tf-keras/keras/callbacks.py
"""
def on_epoch_end(self, epoch, logs=None):
logs = logs or {}
self.epochs_since_last_save += 1
if self.epochs_since_last_save >= self.period:
self.epochs_since_last_save = 0
filepath = self.filepath.format(epoch=epoch, **logs)
if self.save_best_only:
current = logs.get(self.monitor)
if current is None:
warnings.warn('Can save best model only with %s available, '
'skipping.' % (self.monitor), RuntimeWarning)
else:
if self.monitor_op(current, self.best):
if self.verbose > 0:
print('Epoch %05d: %s improved from %0.5f to %0.5f,'
' saving model to %s'
% (epoch, self.monitor, self.best,
current, filepath))
self.best = current
if self.save_weights_only:
self.model.save_weights(filepath, overwrite=True)
else:
if is_development():
self.model.save(filepath, overwrite=True)
else:
self.model.save(filepath.split(
"/")[-1])
with file_io.FileIO(filepath.split(
"/")[-1], mode='rb') as input_f:
with file_io.FileIO(filepath, mode='wb+') as output_f:
output_f.write(input_f.read())
else:
if self.verbose > 0:
print('Epoch %05d: %s did not improve' %
(epoch, self.monitor))
else:
if self.verbose > 0:
print('Epoch %05d: saving model to %s' % (epoch, filepath))
if self.save_weights_only:
self.model.save_weights(filepath, overwrite=True)
else:
if is_development():
self.model.save(filepath, overwrite=True)
else:
self.model.save(filepath.split(
"/")[-1])
with file_io.FileIO(filepath.split(
"/")[-1], mode='rb') as input_f:
with file_io.FileIO(filepath, mode='wb+') as output_f:
output_f.write(input_f.read())
There is a function is_development() that checks if it is the local or gcloud environment. On the local environment I did set the variable LOCAL_ENV=1:
def is_development():
"""check if the environment is local or in the gcloud
created the local variable in bash profile
export LOCAL_ENV=1
Returns:
[boolean] -- True if local env
"""
try:
if os.environ['LOCAL_ENV'] == '1':
return True
else:
return False
except:
return False
Then you can use it:
ModelCheckpointGC(
'gs://your_bucket/models/model.h5',
monitor='loss',
verbose=1,
save_best_only=True,
mode='min'))
I hope that helps someone and saves some time.
A hacky workaround is to save to local filesystem, then copy using TF IO API. I added an example to the Keras example on GoogleCloudPlatform ML samples.
Basically it checks whether the target directory is a GCS path ("gs://") and will force writing h5py to the local filesystem, then copy to GCS using the TF file_io API. See for example: https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/census/keras/trainer/task.py#L146
For me the easiest way is to use gsutil.
model.save('model.h5')
!gsutil -m cp model.h5 gs://name-of-cloud-storage/model.h5
I am not sure why this is not mentioned already, but there is a solution where you don't need to add a copy function in your code.
Install gcsfuse following these steps:
export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb http://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install gcsfuse
Then mount your bucket locally:
mkdir bucket
gcsfuse <cloud_bucket_name> bucket
and then use the local directory bucket/ as the logdir of your model.
Syncing of cloud and local directory will be automated for you and your code can stay clean.
Hope it helps :)
tf.keras.models.save_model(model, filepath, save_format="tf")
save_format: Either 'tf' or 'h5', indicating whether to save the model to Tensorflow SavedModel or HDF5. Defaults to 'tf' in TF 2.X, and 'h5' in TF 1.X.

Tensorflow Serving Retrain using Inception Examples

I've been trying to set up a retrained inception model using the serving tools set up for the Inception model. I have been following the tutorial here. I managed to set up the server using the retrained model. I added the following code to the end of the retrain.py file in order to export it.
export_path = "/tmp/export"
export_version = "1"
# Export inference model.
init_op = tf.group(tf.initialize_all_tables(), name='init_op')
saver = tf.train.Saver(sharded=True)
model_exporter = exporter.Exporter(saver)
signature = exporter.classification_signature(input_tensor=jpeg_data_tensor, scores_tensor=final_tensor)
model_exporter.init(sess.graph.as_graph_def(), default_graph_signature=signature)
model_exporter.export(FLAGS.export_dir, tf.constant(export_version), sess)
print('Successfully exported model to %s' % export_path)
For the time being I only have 4 classes though. I have created the model, using Tensorflow tools (without Serving) I managed to verify that my model works on test images. Now I'm trying to serve it. I set up a server on the model using the following command:
bazel-bin/tensorflow_serving/example/inception_inference --port=9000 /tmp/export/ &> retrain.log &
I get the following output with it continuously printing the last two lines.
I tensorflow_serving/core/basic_manager.cc:189] Using InlineExecutor for BasicManager.
I tensorflow_serving/example/inception_inference.cc:383] Waiting for models to be loaded...
I tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:147] Aspiring version for servable default from path: /tmp/exportdir/00000001
I tensorflow_serving/session_bundle/session_bundle.cc:130] Attempting to load a SessionBundle from: /tmp/exportdir/00000001
I tensorflow_serving/session_bundle/session_bundle.cc:107] Running restore op for SessionBundle
I tensorflow_serving/session_bundle/session_bundle.cc:184] Done loading SessionBundle
I tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:147] Aspiring version for servable default from path: /tmp/exportdir/00000001
I tensorflow_serving/example/inception_inference.cc:349] Running...
I tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:147] Aspiring version for servable default from path: /tmp/exportdir/00000001
I tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:147] Aspiring version for servable default from path: /tmp/exportdir/00000001
When I try to use the python client for inception to test this I get the following error where it doesn't seem to even find my output:
vagrant#ubuntu:~/serving$ bazel-bin/tensorflow_serving/example/inception_client --image=/home/vagrant/AE.JPG
Traceback (most recent call last):
File "/home/vagrant/serving/bazel bin/tensorflow_serving/example/inception_client.runfiles/tf_serving/tensorflow_serving/example/inception_client.py", line 57, in <module>
tf.app.run()
File "/home/vagrant/serving/bazel-bin/tensorflow_serving/example/inception_client.runfiles/tf/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv))
File "/home/vagrant/serving/bazel-bin/tensorflow_serving/example/inception_client.runfiles/tf_serving/tensorflow_serving/example/inception_client.py", line 51, in main
result = stub.Classify(request, 5.0) # 10 secs timeout
File "/usr/local/lib/python2.7/dist-packages/grpc/framework/crust/implementations.py", line 75, in __call__
protocol_options, metadata, request)
File "/usr/local/lib/python2.7/dist-packages/grpc/framework/crust/_calls.py", line 109, in blocking_unary_unary
return next(rendezvous)
File "/usr/local/lib/python2.7/dist-packages/grpc/framework/crust/_control.py", line 415, in next
raise self._termination.abortion_error
grpc.framework.interfaces.face.face.NetworkError: NetworkError(code=StatusCode.INTERNAL, details="FetchOutputs node : not found")
Could someone guide me through this? This is my first time using Tensorflow for an actual project.