rnn translate showing data_utils not found in google-cloud-ml-engine - tensorflow

I want to create a chatbot using Tensorflow.I am using the code in 'github.com/tensorflow/models/tree/master/tutorials/rnn/translate'.While running the code in google-cloud-ml-engine I am getting an exception '/usr/bin/python: No module named data_utils' and the job is getting failed.
Here is the commands I used,
gcloud ml-engine jobs submit training ${JOB_NAME} \
--package-path=. \
--module-name=translate.translate \
--staging-bucket="${TRAIN_BUCKET}" \
--region=us-central1 \
-- \
--from_train_data=${INPUT_TRAIN_DATA_A} \
--to_train_data=${INPUT_TRAIN_DATA_B} \
--from_dev_data=${INPUT_TEST_DATA_A} \
--to_dev_data=${INPUT_TEST_DATA_B} \
--train_dir="${TRAIN_PATH}" \
--data_dir="${TRAIN_PATH}" \
--steps_per_checkpoint=5 \
--from_vocab_size=45000 \
--to_vocab_size=45000
ml_engine log screenshot 1
ml_engine log screenshot 2
Is it the problem with ml_engine or tensorflow?
I followed the blog 'blog.kovalevskyi.com/how-to-train-a-chatbot-with-the-tensorflow-and-google-cloud-ml-3a5617289032' and initially used 'github.com/b0noI/models/tree/translate_tutorial_supports_google_cloud_ml/tutorials/rnn/translate'. It was also giving the same error.

None, it is actually a problem within the code you are uploading,
namely satisfying local dependencies. The filedata_utils.py is located in the same folder as where you got the example from. This is also mentioned in this post you should make sure it is available for your model.

Related

In Google Cloud's "built-in image object detection algorithm" documentation, what is config.yaml?

In the official GCP documentation for the built-in image object detection classifier, Step 2 under "Submit a training job" says:
Submit the job:
cloud ai-platform jobs submit training $JOB_ID \
--region=$REGION \
--config=config.yaml \
...
This is the first reference to "config.yaml" on this page.
Has anyone been able to implement this example?
Here is the code from the above documentation page in full, including a correction on line 2 (the original had a JOB_DIR starting with gs://gs://, which threw an error):
PROJECT_ID="myapp"
# Original:
#BUCKET_NAME="gs://mybucket/"
# Correction:
BUCKET_NAME="mybucket"
REGION="us-central1"
gcloud config set project $PROJECT_ID
gcloud config set compute/region $REGION
# Set paths to the training and validation data.
TRAINING_DATA_PATH="gs://cloud-samples-data/ai-platform/built-in/image/coco/train*"
VALIDATION_DATA_PATH="gs://cloud-samples-data/ai-platform/built-in/image/coco/val*"
# Specify the Docker container for your built-in algorithm selection.
IMAGE_URI="gcr.io/cloud-ml-algos/image_object_detection:latest"
DATASET_NAME="coco"
ALGORITHM="object_detection"
MODEL_NAME="${DATASET_NAME}_${ALGORITHM}_model"
# Give a unique name to your training job.
DATE="$(date '+%Y%m%d_%H%M%S')"
JOB_ID="${MODEL_NAME}_${DATE}"
# Make sure you have access to this Cloud Storage bucket.
JOB_DIR="gs://${BUCKET_NAME}/algorithms_training/${MODEL_NAME}/${DATE}"
gcloud ai-platform jobs submit training $JOB_ID \
--region=$REGION \
--config=config.yaml \
--job-dir=$JOB_DIR \
-- \
--training_data_path=$TRAINING_DATA_PATH \
--validation_data_path=$VALIDATION_DATA_PATH \
--train_batch_size=64 \
--num_eval_images=500 \
--train_steps_per_eval=2000 \
--max_steps=15000 \
--num_classes=90 \
--warmup_steps=500 \
--initial_learning_rate=0.08 \
--fpn_type="nasfpn" \
--aug_scale_min=0.8 \
--aug_scale_max=1.2
gcloud ai-platform jobs describe $JOB_ID
gcloud ai-platform jobs stream-logs $JOB_ID
Running the above results in the following error:
ERROR: (gcloud.ai-platform.jobs.submit.training) Failed to load YAML from [config.yaml]: Unable to read file [config.yaml]: [Errno 2] No such file or directory: u'config.yaml'
Creating an empty config.yaml produces this error:
ERROR: gcloud crashed (AttributeError): 'NoneType' object has no attribute 'get'
From the gcloud documentation:
Path to the job configuration file. This file should be a YAML
document (JSON also accepted) containing a Job resource as defined in
the API (all fields are optional):
https://cloud.google.com/ml/reference/rest/v1/projects.jobs
I submitted feedback on this page a couple of weeks ago, but haven't heard back and it is still broken.
What content is required in config.yaml to make this work?
Any and all ideas/suggestions are welcome!
I managed to get it working by replacing this command line argument:
--config=config.yaml
With this one:
--master-image-uri $IMAGE_URI

Error when submitting training job to gcloud

I am new to training on Google Cloud.
When I am running the training job, I get the following error:
(gcloud.ml-engine.jobs.submit.training) Could not copy [research/dist/object_detection-0.1.tar.gz] to [training/packages/c5292b23e57f357dc2d63baab473c04337dbadd2deeb10965e743cd8422b964f/object_detection-0.1.tar.gz]. Please retry: HTTPError 404: Not Found
I am using this to run the training job
gcloud ml-engine jobs submit training job1 \
--job-dir=gs://${ml-project-neu}/training \
--packages research/dist/object_detection-0.1.tar.gz,research/slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--config cloud.yml \
--runtime-version=1.4
-- \
--train_dir=gs://${ml-project-neu}/training \
--pipeline_config_path=gs://${ml-project-neu}/data/faster_rcnn_inception_v2_pets.config
Make sure ${ml-poject-neu} is valid (it may be the empty string in your case); Make sure gs://${ml-project-neu} exists. And make sure the credentials you are using with gcloud have access to your GCS bucket (consider running gcloud auth login).

RuntimeError: `RunConfig` instance is expected to be used by the `Estimator` inside the `Experiment`

Trying to run this tutorial experiment:
https://cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction#local-train-single
Running locally, in virtualenv, Python v2.7, TensorFlow v1.2
When executing this command:
gcloud ml-engine local train \
--module-name trainer.task \
--package-path trainer/ \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 1000 \
--job-dir $MODEL_DIR \
--eval-steps 100
I get the following error:
RuntimeError: `RunConfig` instance is expected to be used by the `Estimator` inside the `Experiment`.
expected
_cluster_spec={},
_environment=u'cloud',
_evaluation_master='',
_is_chief=True,
_master='',
_model_dir='output',
_num_ps_replicas=0,
_num_worker_replicas=0,
_task_id=0,
_task_type=None,
_tf_config=<tensorflow.core.protobuf.config_pb2.ConfigProto object at 0x111833938>,
but got
_cluster_spec={},
_environment=u'cloud',
_evaluation_master='',
_is_chief=True,
_master='',
_model_dir='output',
_num_ps_replicas=0,
_num_worker_replicas=0,
_task_id=0,
_task_type=None,
_tf_config=<tensorflow.core.protobuf.config_pb2.ConfigProto object at 0x111833c80>
So, it appears it is not finding the _tf_config at the right address. Have not been able to find documentation on how to set that up. Thanks.
UPDATE:
Appears to have something to do with my virtualenv setup. Works fine when I install tensorflow natively.

Cannot resubmit job to ml-engine because "A job with this id already exists"

I am trying to submit a job to gcloud ml-engine. For reference the job is using this sample provided by Google
It went through the first time, but with errors unrelated to this question, and now I am trying reissue the command after having corrected my errors:
gcloud ml-engine jobs submit training $JOB_NAME \
--stream-logs \
--runtime-version 1.0 \
--job-dir $GCS_JOB_DIR \
--module-name trainer.task \
--package-path trainer/ \
--region us-east1 \
-- \
--train-files $TRAIN_GCS_FILE \
--eval-files $EVAL_GCS_FILE \
--train-steps $TRAIN_STEPS
, where $JOB_NAME = census. Unfortunately, it seems that I cannot proceed with resubmitting the job unless I change $JOB_NAME to be something like census2, then census3, etc. for every new job.
The following is the error I receive:
ERROR: (gcloud.ml-engine.jobs.submit.training) Project [my-project-name]
is the subject of a conflict: Field: job.job_id Error: A job with this
id already exists.
Is this part of the design to not be able to resubmit using the same job name or I am missing something?
Like Chunck just said, simply try setting JOB_NAME as:
JOB_NAME="census_$(date +%Y%m%d_%H%M%S)"
Not sure if this will help but in Google's sample code for flowers, the error is avoided by appending the date and time to the job id as shown on line 22, e.g.,
declare -r JOB_ID="flowers_${USER}_$(date +%Y%m%d_%H%M%S)"

drums_rnn_generate:command not found

when I tried to use magenta models——"Drums RNN" to generate my drum tracks, I got this info:
drums_rnn_generate: command not found
and here is my .sh file
BUNDLE_PATH='/root/magenta/model/drum_kit_rnn.mag'
CONFIG='drum_kit'
drums_rnn_generate \
--config=${CONFIG} \
--bundle_file=${BUNDLE_PATH} \
--output_dir=/root/magenta/drums_rnn/generated \
--num_outputs=10 \
--num_steps=128 \
--primer_drums="[(36,)]"
I have installed magenta environment successfully,and activated magenta. it's strange that I can use command melody_rnn_generate which is from "melody_rnn" without any problem.
Does someone get the same problem like mine?I want to know where the problem is and how to sovle it.Thanks anyway.