Unable to deploy a CNN model on GCP AI platform - tensorflow

I'm trying to deploy a model to GCP AI Platform but not getting anywhere, I know similar question has been asked before, but I can't make out what I'm doing wrong, any help would be appreciated. I'm using jupyter notebook for development.
MODEL_NAME='CLASSIFIER'
VERSION_NAME='v1'
RUNTIME_VERSION='1.13'
REGION='europe-west1'
!gcloud ai-platform models create {MODEL_NAME} --regions {REGION}
The model is created in the global end point
!gcloud beta ai-platform versions create {VERSION_NAME} \
--model {MODEL_NAME} \
--origin gs://{BUCKET}/{MODEL_DIR} \
--python-version 3.7 \
--runtime-version {RUNTIME_VERSION} \
--package-uris gs://{BUCKET}/{PACKAGES_DIR}/sentiment_classifier-0.1.tar.gz \
--prediction-class=model_prediction.CustomPrediction \
--machine-type mls1-c4-m4
but when I try to create the version, it tries to create it in the regional end point and fails, or so I think
Using endpoint [https://us-central1-ml.googleapis.com/]
ERROR: (gcloud.beta.ai-platform.versions.create) INVALID_ARGUMENT: Machine type is not available on this endpoint.

I was able to run the command without issues. For it I use an existing sample to replicate the execution of the commands .
As a remark, both models point to https://ml.googleapis.com as this region accepts the execution of beta commands as the one used for this case. You can see details of it on this link under --region. As a note, you can choose different main regions but keep in mind that it should support the execution of beta commands and that both (the model and the version) points to the same endpoint.
model
MODEL_NAME = 'CensusPredictor'
VERSION_NAME = 'v1'
REGION = "global"
! gcloud ai-platform models create $MODEL_NAME \
--regions $REGION
output
WARNING: To specify a region where the model will deployed on the global endpoint, please use `--regions` and do not specify `--region`. Using [us-central1] by default on https://ml.googleapis.com. Please note that your model will be inaccessible from https://us-central1-ml.googelapis.com
Learn more about regional endpoints and see a list of available regions: https://cloud.google.com/ai-platform/prediction/docs/regional-endpoints
Using endpoint [https://ml.googleapis.com/]
Created ai platform model [projects/<my-project-id>/models/CensusPredictor].
beta ai-platform versions create
! gcloud beta ai-platform versions create $VERSION_NAME --model $MODEL_NAME \
--origin gs://$BUCKET_NAME/custom_pipeline_tutorial/model/ \
--runtime-version 1.13 \
--python-version 3.5 \
--framework SCIKIT_LEARN \
--package-uris gs://$BUCKET_NAME/custom_pipeline_tutorial/code/census_package-1.0.tar.gz \
--region $REGION \
--machine-type mls1-c1-m2
output
Using endpoint [https://ml.googleapis.com/]
Creating version (this might take a few minutes)......done.

Related

In Google Cloud's "built-in image object detection algorithm" documentation, what is config.yaml?

In the official GCP documentation for the built-in image object detection classifier, Step 2 under "Submit a training job" says:
Submit the job:
cloud ai-platform jobs submit training $JOB_ID \
--region=$REGION \
--config=config.yaml \
...
This is the first reference to "config.yaml" on this page.
Has anyone been able to implement this example?
Here is the code from the above documentation page in full, including a correction on line 2 (the original had a JOB_DIR starting with gs://gs://, which threw an error):
PROJECT_ID="myapp"
# Original:
#BUCKET_NAME="gs://mybucket/"
# Correction:
BUCKET_NAME="mybucket"
REGION="us-central1"
gcloud config set project $PROJECT_ID
gcloud config set compute/region $REGION
# Set paths to the training and validation data.
TRAINING_DATA_PATH="gs://cloud-samples-data/ai-platform/built-in/image/coco/train*"
VALIDATION_DATA_PATH="gs://cloud-samples-data/ai-platform/built-in/image/coco/val*"
# Specify the Docker container for your built-in algorithm selection.
IMAGE_URI="gcr.io/cloud-ml-algos/image_object_detection:latest"
DATASET_NAME="coco"
ALGORITHM="object_detection"
MODEL_NAME="${DATASET_NAME}_${ALGORITHM}_model"
# Give a unique name to your training job.
DATE="$(date '+%Y%m%d_%H%M%S')"
JOB_ID="${MODEL_NAME}_${DATE}"
# Make sure you have access to this Cloud Storage bucket.
JOB_DIR="gs://${BUCKET_NAME}/algorithms_training/${MODEL_NAME}/${DATE}"
gcloud ai-platform jobs submit training $JOB_ID \
--region=$REGION \
--config=config.yaml \
--job-dir=$JOB_DIR \
-- \
--training_data_path=$TRAINING_DATA_PATH \
--validation_data_path=$VALIDATION_DATA_PATH \
--train_batch_size=64 \
--num_eval_images=500 \
--train_steps_per_eval=2000 \
--max_steps=15000 \
--num_classes=90 \
--warmup_steps=500 \
--initial_learning_rate=0.08 \
--fpn_type="nasfpn" \
--aug_scale_min=0.8 \
--aug_scale_max=1.2
gcloud ai-platform jobs describe $JOB_ID
gcloud ai-platform jobs stream-logs $JOB_ID
Running the above results in the following error:
ERROR: (gcloud.ai-platform.jobs.submit.training) Failed to load YAML from [config.yaml]: Unable to read file [config.yaml]: [Errno 2] No such file or directory: u'config.yaml'
Creating an empty config.yaml produces this error:
ERROR: gcloud crashed (AttributeError): 'NoneType' object has no attribute 'get'
From the gcloud documentation:
Path to the job configuration file. This file should be a YAML
document (JSON also accepted) containing a Job resource as defined in
the API (all fields are optional):
https://cloud.google.com/ml/reference/rest/v1/projects.jobs
I submitted feedback on this page a couple of weeks ago, but haven't heard back and it is still broken.
What content is required in config.yaml to make this work?
Any and all ideas/suggestions are welcome!
I managed to get it working by replacing this command line argument:
--config=config.yaml
With this one:
--master-image-uri $IMAGE_URI

Using TPU on Cloud ML Engine

I am trying to use TPU on Cloud ML Engine but I am at a loss as to how I should provide the tpu argument which TPUClusterResolver expects.
This is the environment I am using:
--python-version 3.5 \
--runtime-version 1.12 \
--region us-central1 \
--scale-tier BASIC_TPU
The job crashes with:
ValueError: Please provide a TPU Name to connect to.
As a separate issue - ML engine seems to be adding --master grpc://10.129.152.2:8470 on its own to my job which also crashes the job. As a workaround for it I just added an un-used master flag to my code.
this was a known issue for runtime 1.11 and 1.12 and it has been fixed. Now, the service won't append --master to your training application. You should continue using TpuClusterResolver.

Google Cloud ML: Use Nightly TF Import Error No Module tensorflow

I want to train the NMT model from Google on Google Cloud ML.
NMT Model
Now I put all input data in a bucket and downloaded the git repository.
The model needs the nightly version of tensorflow so I defined it in setup.py and when I use the cpu version tf-nightly==1.5.0-dev20171115 and run the following command to run it in GCP local it works.
Train local on google:
gcloud ml-engine local train --package-path nmt/ \
--module-name nmt.nmt \
-- --src=en --tgt=de \
--hparams_path=$HPARAMAS_PATH \
--out_dir=$OUTPUT_DIR \
--vocab_prefix=$VOCAB_PREFIX \
--train_prefix=$TRAIN_PREFIX \
--dev_prefix=$DEV_PREFIX \
--test_prefix=$TEST_PREFIX
Now when I use the gpu version with the following command I got this error message few minutes after submitting the job.
Train on cloud
gcloud ml-engine jobs submit training $JOB_NAME \
--runtime-version 1.2 \
--job-dir $JOB_DIR \
--package-path nmt/ \
--module-name nmt.nmt \
--scale-tier BAISC_GPU \
--region $REGION \
-- --src=en --tgt=de \
--hparams_path=$HPARAMAS_PATH \
--out_dir=$OUTPUT_DIR \
--vocab_prefix=$VOCAB_PREFIX \
--train_prefix=$TRAIN_PREFIX \
--dev_prefix=$DEV_PREFIX \
--test_prefix=$TEST_PREFIX
Error:
import tensorflow as tf ImportError: No module named tensorflow
setup.py:
from setuptools import find_packages
from setuptools import setup
REQUIRED_PACKAGES = ['tf-nightly-gpu==1.5.0-dev20171115']
setup(
name="nmt",
install_requires=REQUIRED_PACKAGES,
packages=find_packages(),
include_package_data=True,
version='0.1.2'
)
Thank you all in advance
Markus
Update:
I have found a note on
GCP docs
Note: Training with TensorFlow versions 1.3+ is limited to CPUs only. See the Cloud ML Engine release notes for updates.
So it seems to doesn't work currently I think I have to go with the compute engine.
Or is there any hack to got it working?
However thank you for your help
The TensorFlow 1.5 might need newer version of CUDA (i.e., CUDA 9), and but the version CloudML Engine installed is CUDA 8. Can you please try to use TensorFlow 1.4 instead, which works on CUDA 8? Please tell us if 1.4 works for you here or send us an email via cloudml-feedback#google.com

Tensorflow couldn't open CUDA library libcupti.so.8.0 during training on google cloud

I'm trying train a model using Tensorflow on the Google Cloud ml-engine. It seems that tensorflow can't get to the libcupti files on the cloud compute machine due to the LD_LIBRARY_PATH not pointing to the correct directory, as implied by the log entry below:
lineno: 126
message: "Couldn't open CUDA library libcupti.so.8.0.
LD_LIBRARY_PATH: /usr/local/cuda/lib64"
levelname: "INFO"
pathname: "tensorflow/stream_executor/dso_loader.cc"
created: 1491143889.84344
As far as I know, the libcupti files are all in /usr/local/cuda/extras/CUPTI/lib64, so I would need to append this to the LD_LIBRARY_PATH variable, but how would I do that when submitting a job via a gcloud ml-engine jobs submit training $JOB_NAME command? Or maybe there's an easier solution?
I tried to use GPU with tensorflow on google cloud and it works for me. In my code I didn't do any GPU specific setting (nor set anything with LD_LIBRARY_PATH)
I think you can try with just a simple and standard tensorflow code and with you submit the job you attach a config then the job should automatically use GPU to do the calculation for you.
Try add a file such as cloudml-gpu.yaml in your module with the following content:
trainingInput:
scaleTier: CUSTOM
# standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4
GPUs
masterType: standard_gpu
runtimeVersion: "1.0"
Then add a option called --config=trainer/cloudml-gpu.yaml (suppose your training module is in a folder called trainer). For example:
export BUCKET_NAME=tf-learn-simple-sentiment
export JOB_NAME="example_5_train_$(date +%Y%m%d_%H%M%S)"
export JOB_DIR=gs://$BUCKET_NAME/$JOB_NAME
export REGION=europe-west1
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir gs://$BUCKET_NAME/$JOB_NAME \
--runtime-version 1.0 \
--module-name trainer.example5-keras \
--package-path ./trainer \
--region $REGION \
--config=trainer/cloudml-gpu.yaml \
-- \
--train-file gs://tf-learn-simple-sentiment/sentiment_set.pickle

how to install tensorflow on google cloud platform

when I use the command pip install tensorflow the download is only 99% complete and terminated at that point. How can I install tensorflow using google cloud shell.
Instead of installing it by yourself you can use the machine learning api and use TensorFlow for training or inference. Just follow this guidelines: https://cloud.google.com/ml/docs/quickstarts/training
You can submit a TensorFlow job like this:
gcloud beta ml jobs submit training ${JOB_NAME} \
--package-path=trainer \
--module-name=trainer.task \
--staging-bucket="${TRAIN_BUCKET}" \
--region=us-central1 \
-- \
--train_dir="${TRAIN_PATH}/train"