Error when doing the Tensorflow tutorial - tensorflow

I follow this tutorial to do the training on Google cloud ml-engine. I follow it step by step but I am facing error when submit the ml job to cloud. I ran this command.
sam#sam-VirtualBox:~/models/research$ gcloud ml-engine jobs submit training whoami_object_detection_date +%s --job-dir=gs://tf_testing/train --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz --module-name object_detection.train --region us-central1 --config object_detection/samples/cloud/cloud.yml -- --train_dir=gs://tf_testing/train --pipeline_config_path=gs://tf_testing/data/faster_rcnn_resnet101_pets.config
and I got this error.
ERROR: (gcloud.ml-engine.jobs.submit.training) FAILED_PRECONDITION: Field: package_uris Error: The provided GCS paths [gs://tf_testing/train/packages/8ec87a281aadb58d3d82462bbffafa9d7e521cc03025209704bc643eb9f3bc37/slim-0.1.tar.gz, gs://tf_testing/train/packages/8ec87a281aadb58d3d82462bbffafa9d7e521cc03025209704bc643eb9f3bc37/object_detection-0.1.tar.gz] cannot be read by service account service-499049193648#cloud-ml.google.com.iam.gserviceaccount.com. - '#type': type.googleapis.com/google.rpc.BadRequest fieldViolations: - description: The provided GCS paths [gs://tf_testing/train/packages/8ec87a281aadb58d3d82462bbffafa9d7e521cc03025209704bc643eb9f3bc37/slim-0.1.tar.gz, gs://tf_testing/train/packages/8ec87a281aadb58d3d82462bbffafa9d7e521cc03025209704bc643eb9f3bc37/object_detection-0.1.tar.gz] cannot be read by service account service-499049193648#cloud-ml.google.com.iam.gserviceaccount.com. field: package_uris
I saw this post and this post and tried the solution but it did not help. FYI, I did not change PATH_TO_BE_CONFIGURED when ran this command. Could it be the reason?
sed -i "s|PATH_TO_BE_CONFIGURED|"gs://${YOUR_GCS_BUCKET}"/data|g" \
object_detection/samples/configs/faster_rcnn_resnet101_pets.config

You need to to allow the service account to read/write to your bucket:
gsutil acl ch -u $SVCACCT:WRITE gs://$BUCKET/
gsutil defacl ch -u $SVCACCT:O gs://$BUCKET/
Alternately:
gcloud ml-engine init-project
Will add the service account as an editor on the project. Make sure to do this in the project that owns the bucket

Related

Unable to deploy a CNN model on GCP AI platform

I'm trying to deploy a model to GCP AI Platform but not getting anywhere, I know similar question has been asked before, but I can't make out what I'm doing wrong, any help would be appreciated. I'm using jupyter notebook for development.
MODEL_NAME='CLASSIFIER'
VERSION_NAME='v1'
RUNTIME_VERSION='1.13'
REGION='europe-west1'
!gcloud ai-platform models create {MODEL_NAME} --regions {REGION}
The model is created in the global end point
!gcloud beta ai-platform versions create {VERSION_NAME} \
--model {MODEL_NAME} \
--origin gs://{BUCKET}/{MODEL_DIR} \
--python-version 3.7 \
--runtime-version {RUNTIME_VERSION} \
--package-uris gs://{BUCKET}/{PACKAGES_DIR}/sentiment_classifier-0.1.tar.gz \
--prediction-class=model_prediction.CustomPrediction \
--machine-type mls1-c4-m4
but when I try to create the version, it tries to create it in the regional end point and fails, or so I think
Using endpoint [https://us-central1-ml.googleapis.com/]
ERROR: (gcloud.beta.ai-platform.versions.create) INVALID_ARGUMENT: Machine type is not available on this endpoint.
I was able to run the command without issues. For it I use an existing sample to replicate the execution of the commands .
As a remark, both models point to https://ml.googleapis.com as this region accepts the execution of beta commands as the one used for this case. You can see details of it on this link under --region. As a note, you can choose different main regions but keep in mind that it should support the execution of beta commands and that both (the model and the version) points to the same endpoint.
model
MODEL_NAME = 'CensusPredictor'
VERSION_NAME = 'v1'
REGION = "global"
! gcloud ai-platform models create $MODEL_NAME \
--regions $REGION
output
WARNING: To specify a region where the model will deployed on the global endpoint, please use `--regions` and do not specify `--region`. Using [us-central1] by default on https://ml.googleapis.com. Please note that your model will be inaccessible from https://us-central1-ml.googelapis.com
Learn more about regional endpoints and see a list of available regions: https://cloud.google.com/ai-platform/prediction/docs/regional-endpoints
Using endpoint [https://ml.googleapis.com/]
Created ai platform model [projects/<my-project-id>/models/CensusPredictor].
beta ai-platform versions create
! gcloud beta ai-platform versions create $VERSION_NAME --model $MODEL_NAME \
--origin gs://$BUCKET_NAME/custom_pipeline_tutorial/model/ \
--runtime-version 1.13 \
--python-version 3.5 \
--framework SCIKIT_LEARN \
--package-uris gs://$BUCKET_NAME/custom_pipeline_tutorial/code/census_package-1.0.tar.gz \
--region $REGION \
--machine-type mls1-c1-m2
output
Using endpoint [https://ml.googleapis.com/]
Creating version (this might take a few minutes)......done.

How to use GPU on AI Platform Pipelines

How do I use a GPU on AI Platform Pipelines? My pipeline uses set_gpu_limit(1) in one of the ops but I end up getting a This step is in Pending state with this message: Unschedulable: 0/3 nodes are available: 3 Insufficient nvidia.com/gpu. error.
Got it a few mins later... I followed the normal Kubeflow on GPU instructions
export GPU_POOL_NAME=gpu-pool
export CLUSTER_NAME=cluster-1
gcloud container node-pools create ${GPU_POOL_NAME} \
--accelerator type=nvidia-tesla-k80,count=1 \
--zone us-central1-a --cluster ${CLUSTER_NAME} \
--num-nodes=0 --machine-type=n1-standard-4 --min-nodes=0 --max-nodes=1 --enable-autoscaling
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

In Google Cloud's "built-in image object detection algorithm" documentation, what is config.yaml?

In the official GCP documentation for the built-in image object detection classifier, Step 2 under "Submit a training job" says:
Submit the job:
cloud ai-platform jobs submit training $JOB_ID \
--region=$REGION \
--config=config.yaml \
...
This is the first reference to "config.yaml" on this page.
Has anyone been able to implement this example?
Here is the code from the above documentation page in full, including a correction on line 2 (the original had a JOB_DIR starting with gs://gs://, which threw an error):
PROJECT_ID="myapp"
# Original:
#BUCKET_NAME="gs://mybucket/"
# Correction:
BUCKET_NAME="mybucket"
REGION="us-central1"
gcloud config set project $PROJECT_ID
gcloud config set compute/region $REGION
# Set paths to the training and validation data.
TRAINING_DATA_PATH="gs://cloud-samples-data/ai-platform/built-in/image/coco/train*"
VALIDATION_DATA_PATH="gs://cloud-samples-data/ai-platform/built-in/image/coco/val*"
# Specify the Docker container for your built-in algorithm selection.
IMAGE_URI="gcr.io/cloud-ml-algos/image_object_detection:latest"
DATASET_NAME="coco"
ALGORITHM="object_detection"
MODEL_NAME="${DATASET_NAME}_${ALGORITHM}_model"
# Give a unique name to your training job.
DATE="$(date '+%Y%m%d_%H%M%S')"
JOB_ID="${MODEL_NAME}_${DATE}"
# Make sure you have access to this Cloud Storage bucket.
JOB_DIR="gs://${BUCKET_NAME}/algorithms_training/${MODEL_NAME}/${DATE}"
gcloud ai-platform jobs submit training $JOB_ID \
--region=$REGION \
--config=config.yaml \
--job-dir=$JOB_DIR \
-- \
--training_data_path=$TRAINING_DATA_PATH \
--validation_data_path=$VALIDATION_DATA_PATH \
--train_batch_size=64 \
--num_eval_images=500 \
--train_steps_per_eval=2000 \
--max_steps=15000 \
--num_classes=90 \
--warmup_steps=500 \
--initial_learning_rate=0.08 \
--fpn_type="nasfpn" \
--aug_scale_min=0.8 \
--aug_scale_max=1.2
gcloud ai-platform jobs describe $JOB_ID
gcloud ai-platform jobs stream-logs $JOB_ID
Running the above results in the following error:
ERROR: (gcloud.ai-platform.jobs.submit.training) Failed to load YAML from [config.yaml]: Unable to read file [config.yaml]: [Errno 2] No such file or directory: u'config.yaml'
Creating an empty config.yaml produces this error:
ERROR: gcloud crashed (AttributeError): 'NoneType' object has no attribute 'get'
From the gcloud documentation:
Path to the job configuration file. This file should be a YAML
document (JSON also accepted) containing a Job resource as defined in
the API (all fields are optional):
https://cloud.google.com/ml/reference/rest/v1/projects.jobs
I submitted feedback on this page a couple of weeks ago, but haven't heard back and it is still broken.
What content is required in config.yaml to make this work?
Any and all ideas/suggestions are welcome!
I managed to get it working by replacing this command line argument:
--config=config.yaml
With this one:
--master-image-uri $IMAGE_URI

Error when submitting training job to gcloud

I am new to training on Google Cloud.
When I am running the training job, I get the following error:
(gcloud.ml-engine.jobs.submit.training) Could not copy [research/dist/object_detection-0.1.tar.gz] to [training/packages/c5292b23e57f357dc2d63baab473c04337dbadd2deeb10965e743cd8422b964f/object_detection-0.1.tar.gz]. Please retry: HTTPError 404: Not Found
I am using this to run the training job
gcloud ml-engine jobs submit training job1 \
--job-dir=gs://${ml-project-neu}/training \
--packages research/dist/object_detection-0.1.tar.gz,research/slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--config cloud.yml \
--runtime-version=1.4
-- \
--train_dir=gs://${ml-project-neu}/training \
--pipeline_config_path=gs://${ml-project-neu}/data/faster_rcnn_inception_v2_pets.config
Make sure ${ml-poject-neu} is valid (it may be the empty string in your case); Make sure gs://${ml-project-neu} exists. And make sure the credentials you are using with gcloud have access to your GCS bucket (consider running gcloud auth login).

Cannot resubmit job to ml-engine because "A job with this id already exists"

I am trying to submit a job to gcloud ml-engine. For reference the job is using this sample provided by Google
It went through the first time, but with errors unrelated to this question, and now I am trying reissue the command after having corrected my errors:
gcloud ml-engine jobs submit training $JOB_NAME \
--stream-logs \
--runtime-version 1.0 \
--job-dir $GCS_JOB_DIR \
--module-name trainer.task \
--package-path trainer/ \
--region us-east1 \
-- \
--train-files $TRAIN_GCS_FILE \
--eval-files $EVAL_GCS_FILE \
--train-steps $TRAIN_STEPS
, where $JOB_NAME = census. Unfortunately, it seems that I cannot proceed with resubmitting the job unless I change $JOB_NAME to be something like census2, then census3, etc. for every new job.
The following is the error I receive:
ERROR: (gcloud.ml-engine.jobs.submit.training) Project [my-project-name]
is the subject of a conflict: Field: job.job_id Error: A job with this
id already exists.
Is this part of the design to not be able to resubmit using the same job name or I am missing something?
Like Chunck just said, simply try setting JOB_NAME as:
JOB_NAME="census_$(date +%Y%m%d_%H%M%S)"
Not sure if this will help but in Google's sample code for flowers, the error is avoided by appending the date and time to the job id as shown on line 22, e.g.,
declare -r JOB_ID="flowers_${USER}_$(date +%Y%m%d_%H%M%S)"