How can I use a Cloud TPU with Tensorflow Lite Model Maker? - tensorflow

I'm training an object detection model (EfficientDet-Lite) using Tensorflow Lite Model Maker in Colab and I'd like to use a Cloud TPU. I have all the images in a GCS bucket and provide a CSV file. When I call object_detector.create I get the following error:
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py in shape(self)
1196 # `_tensor_shape` is declared and defined in the definition of
1197 # `EagerTensor`, in C.
-> 1198 self._tensor_shape = tensor_shape.TensorShape(self._shape_tuple())
1199 except core._NotOkStatusException as e:
1200 six.raise_from(core._status_to_exception(e.code, e.message), None)
InvalidArgumentError: Unsuccessful TensorSliceReader constructor: Failed to get matching files on /tmp/tfhub_modules/db7544dcac01f8894d77bea9d2ae3c41ba90574c/variables/variables: Unimplemented: File system scheme '[local]' not implemented (file: '/tmp/tfhub_modules/db7544dcac01f8894d77bea9d2ae3c41ba90574c/variables/variables')
That looks like it's trying to process some local files in the CloudTPU, which doesn't work...
The gist of what I'm doing is:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
train_data, validation_data, test_data = object_detector.DataLoader.from_csv(
drive_dir + csv_name,
images_dir = "images" if not tpu else None,
cache_dir = drive_dir + "cub_cache",
)
spec = MODEL_SPEC(tflite_max_detections=10, strategy='tpu', tpu=tpu.master(), gcp_project="xxx")
model = object_detector.create(train_data=train_data,
model_spec=spec,
validation_data=validation_data,
epochs=epochs,
batch_size=batch_size,
train_whole_model=True)
I can't find any example with Model Maker that uses Cloud TPU.
Edit: the error seems to occur when the EfficientDet model gets loaded, so somehow modelmaker must be pointing to a local file that doesn't work for CloudTPU?

Yeah the error is happening with TFHub, which seems to be well known. Basically TF Hub loading tries to use a local cache which TPU doesn't have access to (and the Colab doesn't even provide). Check out https://github.com/tensorflow/hub/issues/604 which should get you past this error.

Download from TF-Hub the model you would like to train (replace X: 0<=X<=4):
https://tfhub.dev/tensorflow/efficientdet/liteX/feature-vector/1
Extract the package twice until you get to the "keras_metadata.pb", "saved_model.pb" and "variables" folder
Upload these files and folders on a Google Cloud Bucket
Pass the uri argument to model_spec.get (https://www.tensorflow.org/lite/tutorials/model_maker_object_detection), pointing to the Cloud Bucket folder (in gs:// format)

Related

dataset from tf.data.Dataset.save(load) for GCS + TPU got NotFoundError Could not find metadata file. [Op:MakeIterator]

I tried this both on TF 2.8 and Nightly (2.10.0-dev20220616) on Colab. In 2.8, i used the experimental version of this API. I am trying this in the context of trying to train a huggingface bert model. The dataset has gone through several transformations to arrive at a tf dataset proper.
To repro:
In a session without TPU, source/construct the dataset
use tf.data.Dataset.save(train_set, 'gs://ai-tests/data/train_set'). I tried to read it back with load and able to iterate through it (it seems to work here).
In a separate TPU session, i tried to get train_set = tf.data.Dataset.load('gs://ai-tests/data/train_set')
Iterating get error
for x in train_set:
break
/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py
in raise_from_not_ok_status(e, name)
7207 def raise_from_not_ok_status(e, name):
7208 e.message += (" name: " + name if name is not None else "")
-> 7209 raise core._status_to_exception(e) from None # pylint: disable=protected-access
7210
7211
NotFoundError: Could not find metadata file. [Op:MakeIterator]

Colab - Save model/ history after TPU training

I'm using the TPU provided by Colab and the training works well. But there are some problems when I use callbacks trying to save the model and training loss.
Initially, I mount my drive, and passed the path, ie "/content/drive/MyDrive/records/", and there's no error returned. But then I found out nothing was saved.
An example would be like this:
csv_logger = CSVLogger("/content/drive/MyDrive/records/model_history_log_TPU.csv", append=True)
And nothing happened.
Then, I tried to save them in Google Cloud Storage, so I passed the path as gs://my-bucket/records/, but the error retuns:
FileNotFoundError: [Errno 2] No such file or directory
So, my question is, what is the correct way to save those data?

What is the use of a *.pb file in TensorFlow and how does it work?

I am using some implementation for creating a face recognition which uses this file:
"facenet.load_model("20170512-110547/20170512-110547.pb")"
What is the use of this file? I am not sure how it works.
console log :
Model filename: 20170512-110547/20170512-110547.pb
distance = 0.72212267
Github link of the actual owner of the code
https://github.com/arunmandal53/facematch
pb stands for protobuf. In TensorFlow, the protbuf file contains the graph definition as well as the weights of the model. Thus, a pb file is all you need to be able to run a given trained model.
Given a pb file, you can load it as follow.
def load_pb(path_to_pb):
with tf.gfile.GFile(path_to_pb, "rb") as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
with tf.Graph().as_default() as graph:
tf.import_graph_def(graph_def, name='')
return graph
Once you have loaded the graph, you can basically do anything. For instance, you can retrieve tensors of interest with
input = graph.get_tensor_by_name('input:0')
output = graph.get_tensor_by_name('output:0')
and use regular TensorFlow routine like:
sess.run(output, feed_dict={input: some_data})
Explanation
The .pb format is the protocol buffer (protobuf) format, and in Tensorflow, this format is used to hold models. Protobufs are a general way to store data by Google that is much nicer to transport, as it compacts the data more efficiently and enforces a structure to the data. When used in TensorFlow, it's called a SavedModel protocol buffer, which is the default format when saving Keras/ Tensorflow 2.0 models. More information about this format can be found here and here.
For example, the following code (specifically, m.save), will create a folder called my_new_model, and save in it, the saved_model.pb, an assets/ folder, and a variables/ folder.
# first download a SavedModel from TFHub.dev, a website with models
m = tf.keras.Sequential([
hub.KerasLayer("https://tfhub.dev/google/imagenet/mobilenet_v2_130_224/classification/4")
])
m.build([None, 224, 224, 3]) # Batch input shape.
m.save("my_new_model") # defaults to save as SavedModel in tensorflow 2
In some places, you may also see .h5 models, which was the default format for TF 1.X. source
Extra information: In TensorFlow Lite, the library for running models on mobile and IoT devices, instead of protocol buffers, flatbuffers are used. This is what the TensorFlow Lite Converter converts into (.tflite format). This is another Google format which is also very efficient: it allows access to any part of the message without deserialization (unlike json, xml). For devices with less memory (RAM), it makes more sense to load what you need from the model file, instead of loading the entire thing into memory to deserialize it.
Loading SavedModels in TensorFlow 2
I noticed BiBi's answer to show loading models was popular, and there is a shorter way to do this in TF2:
import tensorflow as tf
model_path = "/path/to/directory/inception_v1_224_quant_20181026"
model = tf.saved_model.load(model_path)
Note,
the directory (i.e. inception_v1_224_quant_20181026) has to have a saved_model.pb or saved_model.pbtxt, otherwise the code will crash. You cannot specify the .pb path, specify the directory.
you might get TypeError: 'AutoTrackable' object is not callable for older models, fix here.
If you load a TF1 model, I found that I don't get any errors, but the loaded file doesn't behave as expected. (e.g. it doesn't have any functions on it, like predict)

How to read a file during training in AWS SageMaker?

I'm trying to train a custom tensorflow model, using AWS SageMaker. Thus, in the model_fn method, that I should provide, I want to be able to read an external file. I've uploaded the file to S3 and try to read like below:
BUCKET_PATH = 's3://<bucket_name>/data/<prefix>/'
def model_fn(features, labels, mode, params):
# Load vocabulary
vocab_path = os.path.join(BUCKET_PATH, 'vocab.pkl')
with open(vocab_path, 'rb') as f:
vocab = pickle.load(f)
n_vocab = len(vocab)
...
I get an IOError: [Errno 2] No such file or directory
How can I read this file during training?
I don't think pickle.load can ping an S3 bucket. You can either keep the data in the python notebook path or download it using boto3 client.
Moreover, you'd probably not want to download it in model_fn. That would be called for each epoch. Generally data is loaded and prepared in the train_input_fn.

Use tflearn.DNN with google cloud ml-engine

Is there a good way to deploy a model built using tflearn.DNN class to Google Cloud ML Engine? It seems like SavedModel requires input and output tensors to be defined in the prediction signature definition but unsure how to get that from tflearn.DNN.
I figured this out later at least for my specific case. This snippet lets you export your DNN as a SavedModel which can then be deployed to Google Cloud ML Engine.
Snippet is below with the following arguments
filename is the export directory
input_tensor is the input_data layer given to tflearn.DNN
output_tensor is the entire network passed to tflearn.DNN
session is an attribute of the object returned by tflearn.DNN
builder = tf.saved_model.builder.SavedModelBuilder(filename)
signature = tf.saved_model.signature_def_utils.predict_signature_def(
inputs={'in':input_tensor}, outputs={'out':output_tensor})
builder.add_meta_graph_and_variables(session,
[tf.saved_model.tag_constants.SERVING],
signature_def_map={'serving_default':signature})
builder.save()
serving_vars = {
'name':self.name
}
assets = filename + '/assets.extra'
os.makedirs(assets)
with open(assets + '/serve.pkl', 'wb') as f:
pickle.dump(serving_vars, f, pickle.HIGHEST_PROTOCOL)