How can I serve the Faster RCNN with Resnet 101 model with tensorflow serving - object-detection

I am trying to serve the Faster RCNN with Resnet 101 model with tensorflow serving.
I know I need to use tf.saved_model.builder.SavedModelBuilder to export the model definition as well as variables, then I need a script like inception_client.py provided by tensorflow_serving.
while I am going through the examples and documentation and experimenting, I think someone may have done the same thing. So plase help if you have done the same or know how to get it done. Thanks in advance.

Tensorflow Object Detection API has its own exporter script that is more sophisticated than the outdated examples found under Tensorflow Serving.
While building Tensorflow Serving, make sure you pull the latest master commit of tensorflow/tensorflow (>r1.2) and tensorflow/models
Build Tensorflow Serving for GPU
bazel build -c opt --config=cuda tensorflow_serving/...
If you face errors regarding crosstool and nccl, follow the solutions at
https://github.com/tensorflow/serving/issues/186#issuecomment-251152755
https://github.com/tensorflow/serving/issues/327#issuecomment-305771708
Usage
python tf_models/object_detection/export_inference_graph.py \
--pipeline_config_path=/path/to/ssd_inception_v2.config \
--trained_checkpoint_prefix=/path/to/trained/checkpoint/model.ckpt \
--output_directory /path/to/output/1 \
--export_as_saved_model \
--input_type=image_tensor
Note that during export all variables are converted into constants and baked into the protobuf binary. Don't be panicked if you don't find any files under saved_model/variables directory
To start the server,
bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception_v2 --model_base_path=/path/to/output --enable_batching=true
As for the client, the examples under Tensorflow Serving work well

Related

Training using object detection api is not running on GPUs in AI Platform

I am trying to run the training of some models in tensorflow 2 object detection api.
I am using this command:
gcloud ai-platform jobs submit training segmentation_maskrcnn_`date +%m_%d_%Y_%H_%M_%S` \
--runtime-version 2.1 \
--python-version 3.7 \
--job-dir=gs://${MODEL_DIR} \
--package-path ./object_detection \
--module-name object_detection.model_main_tf2 \
--region us-central1 \
--scale-tier CUSTOM \
--master-machine-type n1-highcpu-32 \
--master-accelerator count=4,type=nvidia-tesla-p100 \
-- \
--model_dir=gs://${MODEL_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
The training job is submitted successfully but when I look at my submitted job on AI platform I notice that it's not using the GPUs!
Also, when looking at the logs for my training job, I noticed that in some cases it couldn't open cuda. It would say something like this:
Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib64
I was using AI platform for training a few months back and it was successful. I don't know what has changed now!
In fact, for my own setup, nothing has changed.
For the record, I am training Mask RCNN now. A few months back I trained Faster RCNN and SSD models.
Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib64
I'm not sure as I couldn't test anyhow. With a quick google search, It appeared that people have encountered this issue for many reasons and the solution is some kind of depends. In SO, there is the same query was asked, and you probably missed it somehow, check it first, here.
Also, check this related issue posted below
TensorFlow Issue #26182
TensorFlow Issue #45930
TensorFlow Issue #38578
After checking with every possible solution, and still remains the issue, then update your query with it.
I think there some mismatches in your Cuda version (CUDA, cuDNN) and tf version, you should check them first in your working environment. Also, ensure you update the Cuda path properly. According to the given error message, you need to make it ensure that the following set properly.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.0/lib64/

Gcloud ai-platform local predict Error: gcloud crashed (PermissionError): [WinError 5] Access is denied

I was trying to run a command to test local predict in my computer. However, the command failed every time with this error.
ERROR: gcloud crashed (PermissionError): [WinError 5] Access is denied
This is the command:
gcloud ai-platform local predict --model-dir model_final --json-instances image_b64.json --framework tensorflow
I am positive 101% positive that I have followed everything in the doc by Google.
First, the command required a model file to be saved in TensorFlow SavedModel format, which, since I use Keras, I can just do model.save("model_final").
If you have used Keras for training, use tf.keras.Model.save to export a SavedModel
So I did, at it only output a single file, so I can only assume it's the file to be placed in the --model-dir parameter. I admit doing model.save("model_final") created a file, not a dir, which is a bit weird but the document for Keras just said use that so there is no way I could be wrong.
And also:
If you export your SavedModel using tf.keras.Model.save, then you do not need to specify a serving input function.
If you export a SavedModel from tf.keras or from a TensorFlow estimator, the exported graph is ready for serving by default.
The "image_b64.json" file follows this format:
{"image_bytes":{"b64": base64_jpeg_data )}}
So after 3 hours and having followed everything required by Google, and somehow the gloud still throws me that error. And, yes, of course I have run the command line under Administrator Mode. I also tried it in two of my computers, and I got the same error. I am using Windows, Tensorflow 1.15.
Can anyone point out what is the problem with my implementation, or Google Doc/Keras is just lack lustering. Thank you.

Generate SavedModel from Tensorflow model to serve it on Google Cloud ML

I used TF Hub to retrain a model for image classification. Now I would like to serve it in the cloud. For that i need a SavedModel. The retrain.py script from TF Hub uses tf.saved_model.simple_save to generate the SavedModel after the training is done.
What confuses me is the .pb file inside the SavedModel folder that I get from that method is much smaller than the final .pb saved after the training.
simple_save is also now deprecated and I tried to get my SavedModel after the training is done following this SO issue.
But my variables folder is empty. How can I incorporate that building of SavedModel inside the retrain.py to replace the simple_save method ? Tips would be much appreciated.
To deploy your model to Google Cloud ML, you need a SavedModel which can be produced from tf.saved_model api.
Below are the steps for hosting your trained models in cloud with Cloud ML Engine.
Upload your saved model to a Cloud Storage bucket by setting up a cloud storage bucket using BUCKET_NAME="your_bucket_name"
Select a region for your bucket and set a REGION environment variable.
EGION=us-central1
Create a new bucket gsutil mb -l $REGION gs://$BUCKET_NAME
Upload using
SAVED_MODEL_DIR=$(ls ./your-export-dir-base | tail -1)
gsutil cp -r $SAVED_MODEL_DIR gs://your-bucket
Create a Cloud ML Engine model resource and model version.
Also for your question on incorporating savedmodel inside retrain.py, you need to pass saved model as an argument to the tfhub_module line as below.
python retrain.py --image_dir C: ...\\code\\give the path here --tfhub_module C:
...give the path to saved model directory

How to allow soft device placement when deploying a TensorFlow model to GCP?

I am trying to deploy a TensorFlow model to GCP's Cloud Machine Learning Engine for prediction, but I get the following error:
$> gcloud ml-engine versions create v1 --model $MODEL_NAME --origin $MODEL_BINARIES --runtime-version 1.9
Creating version (this might take a few minutes)......failed.
ERROR: (gcloud.ml-engine.versions.create) Bad model detected with error: "Failed to load model: Loading servable: {name: default version: 1} failed: Invalid argument: Cannot assign a device for operation 'tartarus/dense_2/bias': Operation was explicitly assigned to /device:GPU:3 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.\n\t [[Node: tartarus/dense_2/bias = VariableV2[_class=[\"loc:#tartarus/dense_2/bias\"], _output_shapes=[[200]], container=\"\", dtype=DT_FLOAT, shape=[200], shared_name=\"\", _device=\"/device:GPU:3\"]()]]\n\n (Error code: 0)"
My model was trained on several GPUs, and it seems like the default machines on CMLE don't support GPU for prediction, hence the error I get. So, I am wondering if the following is possible:
Set the allow_soft_placement var to True, so that CMLE can use the CPU instead of the GPU for a given model.
Activate GPU prediction on CMLE for a given model.
If not, how can I deploy a TF model trained on GPUs to CMLE for prediction? It feels like this should be a straightforward feature to use, but I can't find any documentation about it.
Thanks!
I've never used gcloud ml-engine versions create, but when you deploy a training job with gcloud ml-engine jobs submit training, you can add a config flag that identifies a configuration file.
This file lets you identify the target machine for training, and you can use multiple CPUs and GPUs. The documentation for the configuration file is here.

TensorFlow 1.0.1 SavedModelBuilder

I'm currently doing exploration on deploying models on Google ML Engine. At first, I developed a model using TensorFlow 1.1.0 as it's the latest version exist (by the time this question is asked). However, it turned out that the highest supported version of TensorFlow on GCP is 1.0.1.
The problem is, previously when I was using TensorFlow 1.1.0, SavedModelBuilder would correctly save the model as SavedModel and its variables under variables/ directory. However, when I switch to TensorFlow 1.0.1, it didn't work similarly: The SavedModel file was created, but no files was created under variables/ and hence no model can be built using only the SavedModel file (missing files under variables/).
Is it a known bug? Or should I do something in order to make the SavedModelBuilder on TensorFlow 1.0.1 works as what TensorFlow 1.1.0 do?
Thank you.
EDIT, more detail:
Actually, there is no explicit tf.Variables exist in my model. However, there exist several tf.contrib.lookup.MutableDenseHashTables and they're exported correctly in TensorFlow 1.1.0, but not in TensorFlow 1.0.1 (as no variable was exported at all in 1.0.1).
It looks like the ability to save and load models in TensorFlow without variables was introduced in this commit which is only available in 1.1.0.
As a workaround, you could create a dummy (unused) variable in your model.
Edit:
Based on OP update, it sounds like there is a MutableDenseHashTable that isn't being saved out.
You can run TensorFlow 1.1 on CloudML Engine, but it requires manually adding it as an additional package.
First, download the TensorFlow 1.1 wheel. Then specify it as an additional package to your training job, e.g.,
gcloud ml-engine jobs submit training my_job \
--module-name trainer.task \
--staging-bucket gs://my-bucket \
--package-path /my/code/path/trainer \
--packages tensorflow-1.1.0-cp27-cp27mu-manylinux1_x86_64.whl