When I run the following in terminal:
gcloud ml-engine local train --module-name trainer.task --package-path trainer/ --job-dir $MODEL_DIR
It runs successfully but I don't get anything in the output folder. Although according to this I should see some files and checkpoints: https://cloud.google.com/ml-engine/docs/tensorflow/getting-started-training-prediction
In the code I've got this line to save my model:
save_path = saver.save(sess, "./my_mnist_model.ckpt")
That generates following files in the active directory: my_mnist_model.ckpt.index, my_mnist_model.ckpt.meta, my_mnist_model.ckpt.data-00000-of-00001
However they are not in output folder. And when I run it on the Cloud Machine Learning Engine I don't get anything in the specified output folder in my bucket either.
So the model is successfully trained but not saved anywhere.
What am I missing in my code / gcloud command?
Just figured out myself that i need to handle --job-dir myself in the script. From the getting started manual i thought it is handled by gcloud command that runs training.
In the official GCP documentation for the built-in image object detection classifier, Step 2 under "Submit a training job" says:
Submit the job:
cloud ai-platform jobs submit training $JOB_ID \
--region=$REGION \
--config=config.yaml \
This is the first reference to "config.yaml" on this page.
Has anyone been able to implement this example?
Here is the code from the above documentation page in full, including a correction on line 2 (the original had a JOB_DIR starting with gs://gs://, which threw an error):
# Original:
# Correction:
gcloud config set project $PROJECT_ID
gcloud config set compute/region $REGION
# Set paths to the training and validation data.
# Specify the Docker container for your built-in algorithm selection.
# Give a unique name to your training job.
DATE="$(date '+%Y%m%d_%H%M%S')"
# Make sure you have access to this Cloud Storage bucket.
gcloud ai-platform jobs submit training $JOB_ID \
--region=$REGION \
--config=config.yaml \
--job-dir=$JOB_DIR \
-- \
--training_data_path=$TRAINING_DATA_PATH \
--validation_data_path=$VALIDATION_DATA_PATH \
--train_batch_size=64 \
--num_eval_images=500 \
--train_steps_per_eval=2000 \
--max_steps=15000 \
--num_classes=90 \
--warmup_steps=500 \
--initial_learning_rate=0.08 \
--fpn_type="nasfpn" \
--aug_scale_min=0.8 \
gcloud ai-platform jobs describe $JOB_ID
gcloud ai-platform jobs stream-logs $JOB_ID
Running the above results in the following error:
ERROR: (gcloud.ai-platform.jobs.submit.training) Failed to load YAML from [config.yaml]: Unable to read file [config.yaml]: [Errno 2] No such file or directory: u'config.yaml'
Creating an empty config.yaml produces this error:
ERROR: gcloud crashed (AttributeError): 'NoneType' object has no attribute 'get'
From the gcloud documentation:
Path to the job configuration file. This file should be a YAML
document (JSON also accepted) containing a Job resource as defined in
the API (all fields are optional):
I submitted feedback on this page a couple of weeks ago, but haven't heard back and it is still broken.
What content is required in config.yaml to make this work?
Any and all ideas/suggestions are welcome!
I managed to get it working by replacing this command line argument:
With this one:
--master-image-uri $IMAGE_URI
I want to profile my model. This is a tutorial on how to do it: https://towardsdatascience.com/howto-profile-tensorflow-1a49fb18073d. But I would like to use the TensorFlow profiler, as shown in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/profiler/README.md#quick-start. According to this post, the following code should start the profiler:
# When using high-level API, session is usually hidden.
# Under the default ProfileContext, run a few hundred steps.
# The ProfileContext will sample some steps and dump the profiles
# to files. Users can then use command line tool or Web UI for
# interactive profiling.
with tf.contrib.tfprof.ProfileContext('/tmp/train_dir') as pctx:
# High level API, such as slim, Estimator, etc.
bazel-bin/tensorflow/core/profiler/profiler \
tfprof> op -select micros,bytes,occurrence -order_by micros
# To be open sourced...
bazel-bin/tensorflow/python/profiler/profiler_ui \
I generated the file profile_100 and located the directory profiler. So this is what I typed in my terminal:
bazel-/Users/mencia/anaconda/envs/tensorflow_py36/lib/python3.6/site-packages/tensorflow/profiler \
This raised the following error:
bazel-/Users/mencia/anaconda/envs/tensorflow_py36/lib/python3.6/site- packages/tensorflow/profiler: No such file or directory
My directory profiler contains:
But according to the above code, there should be
Which I don't have.
What do I do to start the Profiler?
You have to build the profiler first. Clone the TensorFlow repo (git clone https://github.com/tensorflow/tensorflow.git) and run bazel build --config opt tensorflow/core/profiler:profiler in the root directory.
I was following the link below to replicate the process with new data and new model:
Until reach the last step, I activate the training job with the script below:
gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
--runtime-version 1.4 \
--job-dir=gs://marksbucket0000/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-east1 \
--config /Users/markli/Desktop/chase_ad_object/project_2/cluster_config/cloud.yml \
-- \
--train_dir=gs://marksbucket0000/train \
It seems the job is kicking off successfully:
ob [xxx_object_detection_xxxxxxx] submitted successfully.
Your job is still active. You may view the status of your job with the command
$ gcloud ml-engine jobs describe xxx_object_detection_xxxxxxx
or continue streaming the logs with the command
However, it stops due to the following errors in the log:
Since I am extremely new to Google ML could and tensorflow object detection api, I couldn't find a clue from the log regrading which step I was doing wrong.
The YML cluster configuration file I was using is:
runtimeVersion: "1.4"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 5
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard
I would really appreciate if anyone could at least show me a direction to debug. Thanks so much in advance!
---------------- Update on the question --------------
I have actually got it working by changing the setup.py as below:
"""Setup script for object_detection."""
from setuptools import find_packages
from setuptools import setup
# REQUIRED_PACKAGES = ['Pillow>=1.0', 'Matplotlib>=2.1', 'Cython>=0.28.1']
REQUIRED_PACKAGES = ['Tensorflow>=1.4.0','Pillow>=1.0','Matplotlib>=2.1','Cython>=0.28.1','Jupyter']
packages=[p for p in find_packages() if p.startswith('object_detection')],
description='Tensorflow Object Detection Library',
Though I have running into some "no module found" issue when running the training job, there are a lot of online conversation can quickly identify the solution for it so i am not replicating them here.
However, I did stuck by a issue when running evaluation job - "cannot import pycocotool" and for which i found the solution here: https://github.com/tensorflow/models/issues/3470
Now, both of my training and evaluation jobs are up and running. However, it seems strange that I couldn't see any statistics (ex.loss plot in orange) show up for evaluation job on the tensorbroad's scalar display (Yet, I do see the eval job check-box shows up as a view option in it):
I have also checked the log in eval job and i found the node seems to constantly skipping the image. Is this the cause to the issue? May be some issue with the evaluation dataset?
Log info in eval job:
Parallel interleave functionality is available only in TensorFlow 1.5+. Try changing the line in your YAML to:
runtimeVersion: "1.8"
I was following this tutorial to use tensorflow serving using my object detection model. I am using tensorflow object detection for generating the model. I have created a frozen model using this exporter (the generated frozen model works using python script).
The frozen graph directory has following contents ( nothing on variables directory)
Now when I try to serve the model using the following command,
tensorflow_model_server --port=9000 --model_name=ssd --model_base_path=/serving/ssd_frozen/
It always shows me
tensorflow_serving/model_servers/server_core.cc:421] (Re-)adding
model: ssd 2017-08-07 10:22:43.892834: W
No versions of servable ssd found under base path /serving/ssd_frozen/
2017-08-07 10:22:44.892901: W
No versions of servable ssd found under base path /serving/ssd_frozen/
I had same problem, the reason is because object detection api does not assign version of your model when exporting your detection model. However, tensorflow serving requires you to assign a version number of your detection model, so that you could choose different versions of your models to serve. In your case, you should put your detection model(.pb file and variables folder) under folder:
/serving/ssd_frozen/1/. In this way, you will assign your model to version 1, and tensorflow serving will automatically load this version since you only have one version. By default tensorflow serving will automatically serve the latest version(ie, the largest number of versions).
Note, after you created 1/ folder, the model_base_path is still need to be set to --model_base_path=/serving/ssd_frozen/.
For new version of tf serving, as you know, it no longer supports the model format used to be exported by SessionBundle but now SavedModelBuilder.
I suppose it's better to restore a session from your older model format and then export it by SavedModelBuilder. You can indicate the version of your model with it.
def export_saved_model(version, path, sess=None):
tf.app.flags.DEFINE_integer('version', version, 'version number of the model.')
tf.app.flags.DEFINE_string('work_dir', path, 'your older model directory.')
tf.app.flags.DEFINE_string('model_dir', '/tmp/model_name', 'saved model directory')
FLAGS = tf.app.flags.FLAGS
# you can give the session and export your model immediately after training
if not sess:
saver = tf.train.import_meta_graph(os.path.join(path, 'xxx.ckpt.meta'))
saver.restore(sess, tf.train.latest_checkpoint(path))
export_path = os.path.join(
builder = tf.saved_model.builder.SavedModelBuilder(export_path)
# define the signature def map here
# ...
legacy_init_op = tf.group(tf.tables_initializer(), name='legacy_init_op')
sess, [tf.saved_model.tag_constants.SERVING],
print('Export SavedModel!')
you could find main part of the code above in tf serving example.
Finally it will generate the SavedModel in a format that can be served.
Create a version folder under like - serving/model_name/0000123/saved_model.pb
Answer's above already explained why it is important to keep a version number inside the model folder. Follow below link , here they have different sets of built models , you can take it as a reference.
I was doing this on my personal computer running Ubuntu, not Docker. Note I am in a directory called "serving". This is where I saved my folder "mobile_weight". I had to create a new folder, "0000123" inside "mobile_weight". My path looks like serving->mobile_weight->0000123->(variables folder and saved_model.pb)
The command from the tensorflow serving tutorial should look like (Change model_name and your directory):
nohup tensorflow_model_server \
--rest_api_port=8501 \
--model_name=model_weight \
--model_base_path=/home/murage/Desktop/serving/mobile_weight >server.log 2>&1
So my entire terminal screen looks like:
murage#murage-HP-Spectre-x360-Convertible:~/Desktop/serving$ nohup tensorflow_model_server --rest_api_port=8501 --model_name=model_weight --model_base_path=/home/murage/Desktop/serving/mobile_weight >server.log 2>&1
That error message can also result due to issues with the --volume argument.
Ensure your --volume mount is actually correct and points to the model's dir, as this is a general 'model not found' error, but it just seems more complex.
If on windows just use cmd, otherwise its easy to accidentally use linux file path and linux separators in cygwin or gitbash. Even with the correct file structure you can get OP's error if you don't use the windows absolute path.
#using cygwin
$ echo $TESTDATA
$ docker run -t --rm -p 8501:8501 -v "$TESTDATA/saved_model_half_plus_two_cpu:/models/half_plus_two" -e MODEL_NAME=half_plus_two tensorflow/serving
2021-01-22 20:12:28.995834: W tensorflow_serving/sources/storage_path/file_system_storage_path_source.cc:267] No versions of servable half_plus_two found under base path /models/half_plus_two. Did you forget to name your leaf directory as a number (eg. '/1/')?
Then calling the same command with the same unchanged file structure but with the full windows path using windows file separators, and it works:
#using cygwin
$ export TESTDATA="$(cygpath -w "/home/username/directory/serving/tensorflow_serving/servables/tensorflow/testdata")"
$ echo $TESTDATA
$ docker run -t --rm -p 8501:8501 -v "$TESTDATA\\saved_model_half_plus_two_cpu:/models/half_plus_two" -e MODEL_NAME=half_plus_two tensorflow/serving
2021-01-22 21:10:49.527049: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: half_plus_two version: 1}
I'm trying train a model using Tensorflow on the Google Cloud ml-engine. It seems that tensorflow can't get to the libcupti files on the cloud compute machine due to the LD_LIBRARY_PATH not pointing to the correct directory, as implied by the log entry below:
lineno: 126
message: "Couldn't open CUDA library libcupti.so.8.0.
LD_LIBRARY_PATH: /usr/local/cuda/lib64"
levelname: "INFO"
pathname: "tensorflow/stream_executor/dso_loader.cc"
created: 1491143889.84344
As far as I know, the libcupti files are all in /usr/local/cuda/extras/CUPTI/lib64, so I would need to append this to the LD_LIBRARY_PATH variable, but how would I do that when submitting a job via a gcloud ml-engine jobs submit training $JOB_NAME command? Or maybe there's an easier solution?
I tried to use GPU with tensorflow on google cloud and it works for me. In my code I didn't do any GPU specific setting (nor set anything with LD_LIBRARY_PATH)
I think you can try with just a simple and standard tensorflow code and with you submit the job you attach a config then the job should automatically use GPU to do the calculation for you.
Try add a file such as cloudml-gpu.yaml in your module with the following content:
scaleTier: CUSTOM
# standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4
masterType: standard_gpu
runtimeVersion: "1.0"
Then add a option called --config=trainer/cloudml-gpu.yaml (suppose your training module is in a folder called trainer). For example:
export BUCKET_NAME=tf-learn-simple-sentiment
export JOB_NAME="example_5_train_$(date +%Y%m%d_%H%M%S)"
export REGION=europe-west1
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir gs://$BUCKET_NAME/$JOB_NAME \
--runtime-version 1.0 \
--module-name trainer.example5-keras \
--package-path ./trainer \
--region $REGION \
--config=trainer/cloudml-gpu.yaml \
-- \
--train-file gs://tf-learn-simple-sentiment/sentiment_set.pickle