Google Cloud ML Engine: Trouble With Local Prediction Given saved_model.pb - tensorflow

I've trained a Keras model using the tf.data.Dataset API and am trying to see if I've saved it (as saved_model.pb) correctly, so I can use it on ML Engine. Here's what I've done:
estimator = tf.keras.estimator.model_to_estimator(my_model)
# create serving function...
estimator.export_savedmodel('./export', serving_fn)
So now I'm trying to use gcloud ml-engine local predict to see if I can get a prediction back. I'm doing:
gcloud ml-engine local predict --model-dir=~/path/to/folder --json-instances=instances.json
Unfortunately, I get:
cloud.ml.prediction.prediction_utils.PredictionError: Failed to load model: Cloud ML only supports TF 1.0 or above and models saved in SavedModel format. (Error code: 0)
So then I try adding --runtime-version=1.2 to my command like this:
gcloud ml-engine local predict --model-dir=~/path/to/folder --json-instances=instances.json --runtime-version=1.2
and I get back:
ERROR: (gcloud.ml-engine.local.predict) unrecognized arguments: --runtime-version=1.2
Any idea what I'm doing incorrectly / how to fix?
Thanks!

For posterity: the problem turned out to be an incorrect path. If anybody else encounters this issue try using the full absolute path and ensure you are point to the directory containing the saved_model.pb file.

Related

Gcloud ai-platform local predict Error: gcloud crashed (PermissionError): [WinError 5] Access is denied

I was trying to run a command to test local predict in my computer. However, the command failed every time with this error.
ERROR: gcloud crashed (PermissionError): [WinError 5] Access is denied
This is the command:
gcloud ai-platform local predict --model-dir model_final --json-instances image_b64.json --framework tensorflow
I am positive 101% positive that I have followed everything in the doc by Google.
First, the command required a model file to be saved in TensorFlow SavedModel format, which, since I use Keras, I can just do model.save("model_final").
If you have used Keras for training, use tf.keras.Model.save to export a SavedModel
So I did, at it only output a single file, so I can only assume it's the file to be placed in the --model-dir parameter. I admit doing model.save("model_final") created a file, not a dir, which is a bit weird but the document for Keras just said use that so there is no way I could be wrong.
And also:
If you export your SavedModel using tf.keras.Model.save, then you do not need to specify a serving input function.
If you export a SavedModel from tf.keras or from a TensorFlow estimator, the exported graph is ready for serving by default.
The "image_b64.json" file follows this format:
{"image_bytes":{"b64": base64_jpeg_data )}}
So after 3 hours and having followed everything required by Google, and somehow the gloud still throws me that error. And, yes, of course I have run the command line under Administrator Mode. I also tried it in two of my computers, and I got the same error. I am using Windows, Tensorflow 1.15.
Can anyone point out what is the problem with my implementation, or Google Doc/Keras is just lack lustering. Thank you.

TF1.14][TPU]Can not use custom TFrecord dataset on Colab using TPU

I have created a TFRecord dataset file consisting elements and their corresponding labels. I want to use it for training model on Colab using free TPU. I can load the TFRecord file and even run an iterator just to see the contents however, before the beginning of the epoch it throws following error-
UnimplementedError: From /job:worker/replica:0/task:0:
File system scheme '[local]' not implemented (file: '/content/gdrive/My Drive/data/encodeddata_inGZIP.tfrecord')
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional_1]]
In my understanding, it wants the TFRecord file on the TPU bucket, I don't know how to do that on Colab. How can one use a TFRecord file directly on Colab TPU?
You need to host it on Google Cloud Storage:
All input files and the model directory must use a cloud storage bucket path (gs://bucket-name/...), and this bucket must be accessible from the TPU server. Note that all data processing and model checkpointing is performed on the TPU server, not the local machine.
As mentioned on Google's troubleshooting page: https://cloud.google.com/tpu/docs/troubleshooting#cannot_use_local_filesystem
Hope this helps!

Including BEAM preprocessing graph in Keras models at serving

Short Question:
Since Tensorflow is moving towards Keras and away from Estimators, how can we incorporate our preprocessing pipelines e.g. using tf.Transform and build_serving_input_fn() (which are used for estimators), into our tf.keras models?
From my understanding, the only way to incorporate this preprocessing graph is to first build the model using Keras. Train it. Then export it as an estimator using tf.keras.estimator.model_to_estimator. Then create a serving_input_fn and export the estimator as a saved model, along with this serving_input_fn to be used at serving time.
To me it seems tedious, and not the correct way of doing things. Instead, I would like to go directly from Keras to Saved Model.
Problem
I would like to be able to include APAHCE BEAM preprocessing graph in a Keras saved model.
I would like to serve the trained Keras model, hence I export it using SavedModel. Given a trained model, I would like to apply the following logic to predict the outcome.
raw_features = { 'feature_col_name': [100] } # features to transform
working_dir = 'gs://path_to_transform_fn'
# transform features
transformed_features = tf_transform_output.transform_raw_features(raw_features)
model = load_model('model.h5')
model.predict(x=transformed_features)
When I define my model, I use Functional API and the model has the following inputs:
for i in numerical_features:
num_inputs.append(tf.keras.layers.Input(shape=(1,), name=i))
This is the problem, because tensors are not inputed directly into keras from tf.Dataset, but instead, are linked using the Input() layer.
When I export the model using tf.contrib.saved_model.save_keras_model(model=model, saved_model_path=saved_model_path), I can readily serve predictions if I handle the preprocessing in a separate script. The output of this would look like
Is this what generally happens? For example, I would preprocess the features as part of some external script, and then send transformed_features to the model for prediction.
Ideally, it would all happen within the Keras model/part of a single graph. Currently it seems that I'm using the output of one graph as an input into another graph. Instead I would like to be able to use a single graph.
If using Estimators, we can build a serving_input_fn() which can be included as an argument to the estimator, which allows us to incorporate preprocessing logic into the graph.
I would like to also hear your Keras + SavedModel + Preprocessing ideas on serving models using Cloud ML
For your question on incorporating Apache Beam into the input function for a tf.transform pipeline, see this TF tutorial that explains how to do that:
"https://www.tensorflow.org/tfx/transform/get_started#apache_beam_implementation
On using TF 2.0 SavedModel with Keras, this notebook tutorial demonstrates how to do that:
https://www.tensorflow.org/beta/guide/keras/saving_and_serializing#export_to_savedmodel
The Cloud ML is a way of Google Clouds's Machine learning.
It's quite simple to Get Started and train the UI using their documentation:
Develop and validate your training application locally
Before you run your training application in the cloud, get it running
locally. Local environments provide an efficient development and
validation workflow so that you can iterate quickly. You also won't
incur charges for cloud resources when debugging your application
locally. Get your training data
The relevant data files, adult.data and adult.test, are hosted in a
public Cloud Storage bucket. For purposes of this sample, use the
versions on Cloud Storage, which have undergone some trivial cleaning,
instead of the original source data. See below for more information
about the data.
You can read the data files directly from Cloud Storage or copy them
to your local environment. For purposes of this sample you will
download the samples for local training, and later upload them to your
own Cloud Storage bucket for cloud training.
Download the data to a folder:
mkdir data
gsutil -m cp gs://cloud-samples-data/ml-engine/census/data/* data/
Then, just set the TRAIN_DATA AND EVAL_DATA variables to your local file paths. For example, the following commands set the variables to local paths.
TRAIN_DATA=$(pwd)/data/adult.data.csv
EVAL_DATA=$(pwd)/data/adult.test.csv
Then you have a TSV file like this:
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, <=50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, <=50K
To run it:
gcloud ml-engine local train \
--module-name trainer.task \
--package-path trainer/ \
--job-dir $MODEL_DIR \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 1000 \
--eval-steps 100
For moreTraining considerations like your question says:
Running a Training Job
Cloud Machine Learning Engine provides model training as an
asynchronous (batch) service. This page describes how to configure and
submit a training job by running gcloud ml-engine jobs submit training
from the command line or by sending a request to the API at
projects.jobs.create. Before you begin
Before you can submit a training job, you must package your
application and upload it and any unusual dependencies to a Cloud
Storage bucket. Note: If you use the gcloud command-line tool to
submit your job, you can package the application and submit the job in
the same step. Configuring the job
You pass your parameters to the training service by setting the
members of the Job resource, which includes the items in the
TrainingInput resource.
If you use the gcloud command-line tool to submit your training jobs,
you can:
Specify the most common training parameters as flags of the gcloud ml-engine jobs submit training command.
Pass the remaining parameters in a YAML configuration file, named config.yaml by convention. The configuration file mirrors the
structure of the JSON representation of the Job resource. You pass the
path of your configuration file in the --config flag of the gcloud
ml-engine jobs submit training command. So, if the path to your
configuration file is config.yaml, you must set --config=config.yaml.
Gathering the job configuration data
The following properties are used to define your job.
Job name (jobId)
A name to use for the job (mixed-case letters, numbers, and underscores only, starting with a letter). Cluster configuration
(scaleTier)
A scale tier specifying the type of processing cluster to run your job on. This can be the CUSTOM scale tier, in which case you also
explicitly specify the number and type of machines to use. Training
application package (packageUris)
A packaged training application that is staged in a Cloud Storage location. If you are using the gcloud command-line tool, the
application packaging step is largely automated. See the details in
the guide to packaging your application. Module name (pythonModule)
The name of the main module in your package. The main module is the Python file you call to start the application. If you use the
gcloud command to submit your job, specify the main module name in the
--module-name flag. See the guide to packaging your application. Region (region)
The Compute Engine region where you want your job to run. You should run your training job in the same region as the Cloud Storage
bucket that stores your training data. See the available regions for
Cloud ML Engine services. Job directory (jobDir)
The path to a Cloud Storage location to use for job output. Most training applications save checkpoints during training and save the
trained model to a file at the end of the job. You need a Cloud
Storage location to save them to. Your Google Cloud Platform project
must have write access to this bucket. The training service
automatically passes the path you set for the job directory to your
training application as a command-line argument named job_dir. You can
parse it along with your application's other arguments and use it in
your code. The advantage to using the job directory is that the
training service validates the directory before starting your
application. Runtime version (runtimeVersion)
The Cloud ML Engine version to use for the job. If you don't specify a runtime version, the training service uses the default Cloud
ML Engine runtime version 1.0. Python version (pythonVersion)
The Python version to use for the job. Python 3.5 is available with Cloud ML Engine runtime version 1.4 or greater. If you don't
specify a Python version, the training service uses Python 2.7.
Formatting your configuration parameters
How you specify your configuration details depends on how you are
starting your training job:
Provide the job configuration details to the gcloud ml-engine jobs
submit training command. You can do this in two ways:
With command-line flags.
In a YAML file representing the Job resource. You can name this file whatever you want. By convention the name is config.yaml.
Even if you use a YAML file, certain details must be supplied as
command-line flags. For example, you must provide the --module-name
flag and at least one of --package-path or --packages. If you use
--package-path, you must also include --job-dir or --staging-bucket. Additionally, you must either provide the --region flag or set a
default region for your gcloud client. These options—and any others
you provide as command line flags—will override values for those
options in your configuration file.
Example 1: In this example, you choose a preconfigured machine cluster
and supply all the required details as command-line flags when
submitting the job. No configuration file is necessary. See the guide
to submitting the job in the next section.
Example 2: The following example shows the contents of the
configuration file for a job with a custom processing cluster. The
configuration file includes some but not all of the configuration
details, assuming that you supply the other required details as
command-line flags when submitting the job.
trainingInput: scaleTier: CUSTOM masterType: complex_model_m
workerType: complex_model_m parameterServerType: large_model
workerCount: 9 parameterServerCount: 3 runtimeVersion: '1.13'
pythonVersion: '3.5'
The above example specifies Python version 3.5, which is available
when you use Cloud ML Engine runtime version 1.4 or greater.
Submitting the job
When submitting a training job, you specify two sets of flags:
Job configuration parameters. Cloud ML Engine needs these values to set up resources in the cloud and deploy your application on each
node in the processing cluster.
User arguments, or application parameters. Cloud ML Engine passes the value of these flags through to your application.
Submit a training job using the gcloud ml-engine jobs submit training
command.
First, it's useful to define some environment variables containing
your configuration details. To create a job name, the following code
appends the date and time to the model name:
TRAINER_PACKAGE_PATH="/path/to/your/application/sources"
now=$(date +"%Y%m%d_%H%M%S")
JOB_NAME="your_name_$now"
MAIN_TRAINER_MODULE="trainer.task"
JOB_DIR="gs://your/chosen/job/output/path"
PACKAGE_STAGING_PATH="gs://your/chosen/staging/path"
REGION="us-east1"
RUNTIME_VERSION="1.13"
The following job submission corresponds to configuration example 1
above, where you choose a preconfigured scale tier (basic) and you
decide to supply all the configuration details via command-line flags.
There is no need for a config.yaml file:
gcloud ml-engine jobs submit training $JOB_NAME \
--scale-tier basic \
--package-path $TRAINER_PACKAGE_PATH \
--module-name $MAIN_TRAINER_MODULE \
--job-dir $JOB_DIR \
--region $REGION \
-- \
--user_first_arg=first_arg_value \
--user_second_arg=second_arg_value
The following job submission corresponds to configuration example 2
above, where some of the configuration is in the file and you supply
the other details via command-line flags:
gcloud ml-engine jobs submit training $JOB_NAME \
--package-path $TRAINER_PACKAGE_PATH \
--module-name $MAIN_TRAINER_MODULE \
--job-dir $JOB_DIR \
--region $REGION \
--config config.yaml \
-- \
--user_first_arg=first_arg_value \
--user_second_arg=second_arg_value
Notes:
If you specify an option both in your configuration file (config.yaml) and as a command-line flag, the value on the command
line overrides the value in the configuration file.
The empty -- flag marks the end of the gcloud specific flags and the start of the USER_ARGS that you want to pass to your application.
Flags specific to Cloud ML Engine, such as --module-name, --runtime-version, and --job-dir, must come before the empty -- flag. The Cloud ML Engine service interprets these flags.
The --job-dir flag, if specified, must come before the empty -- flag, because Cloud ML Engine uses the --job-dir to validate the path.
Your application must handle the --job-dir flag too, if specified. Even though the flag comes before the empty --, the --job-dir is also
passed to your application as a command-line flag.
You can define as many USER_ARGS as you need. Cloud ML Engine passes --user_first_arg, --user_second_arg, and so on, through to your
application.

Generate SavedModel from Tensorflow model to serve it on Google Cloud ML

I used TF Hub to retrain a model for image classification. Now I would like to serve it in the cloud. For that i need a SavedModel. The retrain.py script from TF Hub uses tf.saved_model.simple_save to generate the SavedModel after the training is done.
What confuses me is the .pb file inside the SavedModel folder that I get from that method is much smaller than the final .pb saved after the training.
simple_save is also now deprecated and I tried to get my SavedModel after the training is done following this SO issue.
But my variables folder is empty. How can I incorporate that building of SavedModel inside the retrain.py to replace the simple_save method ? Tips would be much appreciated.
To deploy your model to Google Cloud ML, you need a SavedModel which can be produced from tf.saved_model api.
Below are the steps for hosting your trained models in cloud with Cloud ML Engine.
Upload your saved model to a Cloud Storage bucket by setting up a cloud storage bucket using BUCKET_NAME="your_bucket_name"
Select a region for your bucket and set a REGION environment variable.
EGION=us-central1
Create a new bucket gsutil mb -l $REGION gs://$BUCKET_NAME
Upload using
SAVED_MODEL_DIR=$(ls ./your-export-dir-base | tail -1)
gsutil cp -r $SAVED_MODEL_DIR gs://your-bucket
Create a Cloud ML Engine model resource and model version.
Also for your question on incorporating savedmodel inside retrain.py, you need to pass saved model as an argument to the tfhub_module line as below.
python retrain.py --image_dir C: ...\\code\\give the path here --tfhub_module C:
...give the path to saved model directory

How to allow soft device placement when deploying a TensorFlow model to GCP?

I am trying to deploy a TensorFlow model to GCP's Cloud Machine Learning Engine for prediction, but I get the following error:
$> gcloud ml-engine versions create v1 --model $MODEL_NAME --origin $MODEL_BINARIES --runtime-version 1.9
Creating version (this might take a few minutes)......failed.
ERROR: (gcloud.ml-engine.versions.create) Bad model detected with error: "Failed to load model: Loading servable: {name: default version: 1} failed: Invalid argument: Cannot assign a device for operation 'tartarus/dense_2/bias': Operation was explicitly assigned to /device:GPU:3 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.\n\t [[Node: tartarus/dense_2/bias = VariableV2[_class=[\"loc:#tartarus/dense_2/bias\"], _output_shapes=[[200]], container=\"\", dtype=DT_FLOAT, shape=[200], shared_name=\"\", _device=\"/device:GPU:3\"]()]]\n\n (Error code: 0)"
My model was trained on several GPUs, and it seems like the default machines on CMLE don't support GPU for prediction, hence the error I get. So, I am wondering if the following is possible:
Set the allow_soft_placement var to True, so that CMLE can use the CPU instead of the GPU for a given model.
Activate GPU prediction on CMLE for a given model.
If not, how can I deploy a TF model trained on GPUs to CMLE for prediction? It feels like this should be a straightforward feature to use, but I can't find any documentation about it.
Thanks!
I've never used gcloud ml-engine versions create, but when you deploy a training job with gcloud ml-engine jobs submit training, you can add a config flag that identifies a configuration file.
This file lets you identify the target machine for training, and you can use multiple CPUs and GPUs. The documentation for the configuration file is here.