Error when submitting training job to gcloud - tensorflow

I am new to training on Google Cloud.
When I am running the training job, I get the following error:
(gcloud.ml-engine.jobs.submit.training) Could not copy [research/dist/object_detection-0.1.tar.gz] to [training/packages/c5292b23e57f357dc2d63baab473c04337dbadd2deeb10965e743cd8422b964f/object_detection-0.1.tar.gz]. Please retry: HTTPError 404: Not Found
I am using this to run the training job
gcloud ml-engine jobs submit training job1 \
--job-dir=gs://${ml-project-neu}/training \
--packages research/dist/object_detection-0.1.tar.gz,research/slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--config cloud.yml \
--runtime-version=1.4
-- \
--train_dir=gs://${ml-project-neu}/training \
--pipeline_config_path=gs://${ml-project-neu}/data/faster_rcnn_inception_v2_pets.config

Make sure ${ml-poject-neu} is valid (it may be the empty string in your case); Make sure gs://${ml-project-neu} exists. And make sure the credentials you are using with gcloud have access to your GCS bucket (consider running gcloud auth login).

Related

In Google Cloud's "built-in image object detection algorithm" documentation, what is config.yaml?

In the official GCP documentation for the built-in image object detection classifier, Step 2 under "Submit a training job" says:
Submit the job:
cloud ai-platform jobs submit training $JOB_ID \
--region=$REGION \
--config=config.yaml \
...
This is the first reference to "config.yaml" on this page.
Has anyone been able to implement this example?
Here is the code from the above documentation page in full, including a correction on line 2 (the original had a JOB_DIR starting with gs://gs://, which threw an error):
PROJECT_ID="myapp"
# Original:
#BUCKET_NAME="gs://mybucket/"
# Correction:
BUCKET_NAME="mybucket"
REGION="us-central1"
gcloud config set project $PROJECT_ID
gcloud config set compute/region $REGION
# Set paths to the training and validation data.
TRAINING_DATA_PATH="gs://cloud-samples-data/ai-platform/built-in/image/coco/train*"
VALIDATION_DATA_PATH="gs://cloud-samples-data/ai-platform/built-in/image/coco/val*"
# Specify the Docker container for your built-in algorithm selection.
IMAGE_URI="gcr.io/cloud-ml-algos/image_object_detection:latest"
DATASET_NAME="coco"
ALGORITHM="object_detection"
MODEL_NAME="${DATASET_NAME}_${ALGORITHM}_model"
# Give a unique name to your training job.
DATE="$(date '+%Y%m%d_%H%M%S')"
JOB_ID="${MODEL_NAME}_${DATE}"
# Make sure you have access to this Cloud Storage bucket.
JOB_DIR="gs://${BUCKET_NAME}/algorithms_training/${MODEL_NAME}/${DATE}"
gcloud ai-platform jobs submit training $JOB_ID \
--region=$REGION \
--config=config.yaml \
--job-dir=$JOB_DIR \
-- \
--training_data_path=$TRAINING_DATA_PATH \
--validation_data_path=$VALIDATION_DATA_PATH \
--train_batch_size=64 \
--num_eval_images=500 \
--train_steps_per_eval=2000 \
--max_steps=15000 \
--num_classes=90 \
--warmup_steps=500 \
--initial_learning_rate=0.08 \
--fpn_type="nasfpn" \
--aug_scale_min=0.8 \
--aug_scale_max=1.2
gcloud ai-platform jobs describe $JOB_ID
gcloud ai-platform jobs stream-logs $JOB_ID
Running the above results in the following error:
ERROR: (gcloud.ai-platform.jobs.submit.training) Failed to load YAML from [config.yaml]: Unable to read file [config.yaml]: [Errno 2] No such file or directory: u'config.yaml'
Creating an empty config.yaml produces this error:
ERROR: gcloud crashed (AttributeError): 'NoneType' object has no attribute 'get'
From the gcloud documentation:
Path to the job configuration file. This file should be a YAML
document (JSON also accepted) containing a Job resource as defined in
the API (all fields are optional):
https://cloud.google.com/ml/reference/rest/v1/projects.jobs
I submitted feedback on this page a couple of weeks ago, but haven't heard back and it is still broken.
What content is required in config.yaml to make this work?
Any and all ideas/suggestions are welcome!
I managed to get it working by replacing this command line argument:
--config=config.yaml
With this one:
--master-image-uri $IMAGE_URI

Error when doing the Tensorflow tutorial

I follow this tutorial to do the training on Google cloud ml-engine. I follow it step by step but I am facing error when submit the ml job to cloud. I ran this command.
sam#sam-VirtualBox:~/models/research$ gcloud ml-engine jobs submit training whoami_object_detection_date +%s --job-dir=gs://tf_testing/train --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz --module-name object_detection.train --region us-central1 --config object_detection/samples/cloud/cloud.yml -- --train_dir=gs://tf_testing/train --pipeline_config_path=gs://tf_testing/data/faster_rcnn_resnet101_pets.config
and I got this error.
ERROR: (gcloud.ml-engine.jobs.submit.training) FAILED_PRECONDITION: Field: package_uris Error: The provided GCS paths [gs://tf_testing/train/packages/8ec87a281aadb58d3d82462bbffafa9d7e521cc03025209704bc643eb9f3bc37/slim-0.1.tar.gz, gs://tf_testing/train/packages/8ec87a281aadb58d3d82462bbffafa9d7e521cc03025209704bc643eb9f3bc37/object_detection-0.1.tar.gz] cannot be read by service account service-499049193648#cloud-ml.google.com.iam.gserviceaccount.com. - '#type': type.googleapis.com/google.rpc.BadRequest fieldViolations: - description: The provided GCS paths [gs://tf_testing/train/packages/8ec87a281aadb58d3d82462bbffafa9d7e521cc03025209704bc643eb9f3bc37/slim-0.1.tar.gz, gs://tf_testing/train/packages/8ec87a281aadb58d3d82462bbffafa9d7e521cc03025209704bc643eb9f3bc37/object_detection-0.1.tar.gz] cannot be read by service account service-499049193648#cloud-ml.google.com.iam.gserviceaccount.com. field: package_uris
I saw this post and this post and tried the solution but it did not help. FYI, I did not change PATH_TO_BE_CONFIGURED when ran this command. Could it be the reason?
sed -i "s|PATH_TO_BE_CONFIGURED|"gs://${YOUR_GCS_BUCKET}"/data|g" \
object_detection/samples/configs/faster_rcnn_resnet101_pets.config
You need to to allow the service account to read/write to your bucket:
gsutil acl ch -u $SVCACCT:WRITE gs://$BUCKET/
gsutil defacl ch -u $SVCACCT:O gs://$BUCKET/
Alternately:
gcloud ml-engine init-project
Will add the service account as an editor on the project. Make sure to do this in the project that owns the bucket

RuntimeError: `RunConfig` instance is expected to be used by the `Estimator` inside the `Experiment`

Trying to run this tutorial experiment:
https://cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction#local-train-single
Running locally, in virtualenv, Python v2.7, TensorFlow v1.2
When executing this command:
gcloud ml-engine local train \
--module-name trainer.task \
--package-path trainer/ \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 1000 \
--job-dir $MODEL_DIR \
--eval-steps 100
I get the following error:
RuntimeError: `RunConfig` instance is expected to be used by the `Estimator` inside the `Experiment`.
expected
_cluster_spec={},
_environment=u'cloud',
_evaluation_master='',
_is_chief=True,
_master='',
_model_dir='output',
_num_ps_replicas=0,
_num_worker_replicas=0,
_task_id=0,
_task_type=None,
_tf_config=<tensorflow.core.protobuf.config_pb2.ConfigProto object at 0x111833938>,
but got
_cluster_spec={},
_environment=u'cloud',
_evaluation_master='',
_is_chief=True,
_master='',
_model_dir='output',
_num_ps_replicas=0,
_num_worker_replicas=0,
_task_id=0,
_task_type=None,
_tf_config=<tensorflow.core.protobuf.config_pb2.ConfigProto object at 0x111833c80>
So, it appears it is not finding the _tf_config at the right address. Have not been able to find documentation on how to set that up. Thanks.
UPDATE:
Appears to have something to do with my virtualenv setup. Works fine when I install tensorflow natively.

rnn translate showing data_utils not found in google-cloud-ml-engine

I want to create a chatbot using Tensorflow.I am using the code in 'github.com/tensorflow/models/tree/master/tutorials/rnn/translate'.While running the code in google-cloud-ml-engine I am getting an exception '/usr/bin/python: No module named data_utils' and the job is getting failed.
Here is the commands I used,
gcloud ml-engine jobs submit training ${JOB_NAME} \
--package-path=. \
--module-name=translate.translate \
--staging-bucket="${TRAIN_BUCKET}" \
--region=us-central1 \
-- \
--from_train_data=${INPUT_TRAIN_DATA_A} \
--to_train_data=${INPUT_TRAIN_DATA_B} \
--from_dev_data=${INPUT_TEST_DATA_A} \
--to_dev_data=${INPUT_TEST_DATA_B} \
--train_dir="${TRAIN_PATH}" \
--data_dir="${TRAIN_PATH}" \
--steps_per_checkpoint=5 \
--from_vocab_size=45000 \
--to_vocab_size=45000
ml_engine log screenshot 1
ml_engine log screenshot 2
Is it the problem with ml_engine or tensorflow?
I followed the blog 'blog.kovalevskyi.com/how-to-train-a-chatbot-with-the-tensorflow-and-google-cloud-ml-3a5617289032' and initially used 'github.com/b0noI/models/tree/translate_tutorial_supports_google_cloud_ml/tutorials/rnn/translate'. It was also giving the same error.
None, it is actually a problem within the code you are uploading,
namely satisfying local dependencies. The filedata_utils.py is located in the same folder as where you got the example from. This is also mentioned in this post you should make sure it is available for your model.

Cannot resubmit job to ml-engine because "A job with this id already exists"

I am trying to submit a job to gcloud ml-engine. For reference the job is using this sample provided by Google
It went through the first time, but with errors unrelated to this question, and now I am trying reissue the command after having corrected my errors:
gcloud ml-engine jobs submit training $JOB_NAME \
--stream-logs \
--runtime-version 1.0 \
--job-dir $GCS_JOB_DIR \
--module-name trainer.task \
--package-path trainer/ \
--region us-east1 \
-- \
--train-files $TRAIN_GCS_FILE \
--eval-files $EVAL_GCS_FILE \
--train-steps $TRAIN_STEPS
, where $JOB_NAME = census. Unfortunately, it seems that I cannot proceed with resubmitting the job unless I change $JOB_NAME to be something like census2, then census3, etc. for every new job.
The following is the error I receive:
ERROR: (gcloud.ml-engine.jobs.submit.training) Project [my-project-name]
is the subject of a conflict: Field: job.job_id Error: A job with this
id already exists.
Is this part of the design to not be able to resubmit using the same job name or I am missing something?
Like Chunck just said, simply try setting JOB_NAME as:
JOB_NAME="census_$(date +%Y%m%d_%H%M%S)"
Not sure if this will help but in Google's sample code for flowers, the error is avoided by appending the date and time to the job id as shown on line 22, e.g.,
declare -r JOB_ID="flowers_${USER}_$(date +%Y%m%d_%H%M%S)"