Tensorflow couldn't open CUDA library libcupti.so.8.0 during training on google cloud - tensorflow

I'm trying train a model using Tensorflow on the Google Cloud ml-engine. It seems that tensorflow can't get to the libcupti files on the cloud compute machine due to the LD_LIBRARY_PATH not pointing to the correct directory, as implied by the log entry below:
lineno: 126
message: "Couldn't open CUDA library libcupti.so.8.0.
LD_LIBRARY_PATH: /usr/local/cuda/lib64"
levelname: "INFO"
pathname: "tensorflow/stream_executor/dso_loader.cc"
created: 1491143889.84344
As far as I know, the libcupti files are all in /usr/local/cuda/extras/CUPTI/lib64, so I would need to append this to the LD_LIBRARY_PATH variable, but how would I do that when submitting a job via a gcloud ml-engine jobs submit training $JOB_NAME command? Or maybe there's an easier solution?

I tried to use GPU with tensorflow on google cloud and it works for me. In my code I didn't do any GPU specific setting (nor set anything with LD_LIBRARY_PATH)
I think you can try with just a simple and standard tensorflow code and with you submit the job you attach a config then the job should automatically use GPU to do the calculation for you.
Try add a file such as cloudml-gpu.yaml in your module with the following content:
trainingInput:
scaleTier: CUSTOM
# standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4
GPUs
masterType: standard_gpu
runtimeVersion: "1.0"
Then add a option called --config=trainer/cloudml-gpu.yaml (suppose your training module is in a folder called trainer). For example:
export BUCKET_NAME=tf-learn-simple-sentiment
export JOB_NAME="example_5_train_$(date +%Y%m%d_%H%M%S)"
export JOB_DIR=gs://$BUCKET_NAME/$JOB_NAME
export REGION=europe-west1
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir gs://$BUCKET_NAME/$JOB_NAME \
--runtime-version 1.0 \
--module-name trainer.example5-keras \
--package-path ./trainer \
--region $REGION \
--config=trainer/cloudml-gpu.yaml \
-- \
--train-file gs://tf-learn-simple-sentiment/sentiment_set.pickle

Related

Using TPU on Cloud ML Engine

I am trying to use TPU on Cloud ML Engine but I am at a loss as to how I should provide the tpu argument which TPUClusterResolver expects.
This is the environment I am using:
--python-version 3.5 \
--runtime-version 1.12 \
--region us-central1 \
--scale-tier BASIC_TPU
The job crashes with:
ValueError: Please provide a TPU Name to connect to.
As a separate issue - ML engine seems to be adding --master grpc://10.129.152.2:8470 on its own to my job which also crashes the job. As a workaround for it I just added an un-used master flag to my code.
this was a known issue for runtime 1.11 and 1.12 and it has been fixed. Now, the service won't append --master to your training application. You should continue using TpuClusterResolver.

Can anyone help me identify the "bug" in my Google Cloud ML training job?

I was following the link below to replicate the process with new data and new model:
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md
Until reach the last step, I activate the training job with the script below:
gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
--runtime-version 1.4 \
--job-dir=gs://marksbucket0000/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-east1 \
--config /Users/markli/Desktop/chase_ad_object/project_2/cluster_config/cloud.yml \
-- \
--train_dir=gs://marksbucket0000/train \
--pipeline_config_path=gs://marksbucket0000/data/ssd_mobilenet_v1_coco.config
It seems the job is kicking off successfully:
ob [xxx_object_detection_xxxxxxx] submitted successfully.
Your job is still active. You may view the status of your job with the command
$ gcloud ml-engine jobs describe xxx_object_detection_xxxxxxx
or continue streaming the logs with the command
However, it stops due to the following errors in the log:
Since I am extremely new to Google ML could and tensorflow object detection api, I couldn't find a clue from the log regrading which step I was doing wrong.
The YML cluster configuration file I was using is:
trainingInput:
runtimeVersion: "1.4"
scaleTier: CUSTOM
masterType: standard_gpu
workerCount: 5
workerType: standard_gpu
parameterServerCount: 3
parameterServerType: standard
I would really appreciate if anyone could at least show me a direction to debug. Thanks so much in advance!
---------------- Update on the question --------------
I have actually got it working by changing the setup.py as below:
"""Setup script for object_detection."""
from setuptools import find_packages
from setuptools import setup
# REQUIRED_PACKAGES = ['Pillow>=1.0', 'Matplotlib>=2.1', 'Cython>=0.28.1']
REQUIRED_PACKAGES = ['Tensorflow>=1.4.0','Pillow>=1.0','Matplotlib>=2.1','Cython>=0.28.1','Jupyter']
setup(
name='object_detection',
version='0.1',
install_requires=REQUIRED_PACKAGES,
include_package_data=True,
packages=[p for p in find_packages() if p.startswith('object_detection')],
description='Tensorflow Object Detection Library',
)
Though I have running into some "no module found" issue when running the training job, there are a lot of online conversation can quickly identify the solution for it so i am not replicating them here.
However, I did stuck by a issue when running evaluation job - "cannot import pycocotool" and for which i found the solution here: https://github.com/tensorflow/models/issues/3470
Now, both of my training and evaluation jobs are up and running. However, it seems strange that I couldn't see any statistics (ex.loss plot in orange) show up for evaluation job on the tensorbroad's scalar display (Yet, I do see the eval job check-box shows up as a view option in it):
I have also checked the log in eval job and i found the node seems to constantly skipping the image. Is this the cause to the issue? May be some issue with the evaluation dataset?
Log info in eval job:
Parallel interleave functionality is available only in TensorFlow 1.5+. Try changing the line in your YAML to:
runtimeVersion: "1.8"

Google Cloud ML: Use Nightly TF Import Error No Module tensorflow

I want to train the NMT model from Google on Google Cloud ML.
NMT Model
Now I put all input data in a bucket and downloaded the git repository.
The model needs the nightly version of tensorflow so I defined it in setup.py and when I use the cpu version tf-nightly==1.5.0-dev20171115 and run the following command to run it in GCP local it works.
Train local on google:
gcloud ml-engine local train --package-path nmt/ \
--module-name nmt.nmt \
-- --src=en --tgt=de \
--hparams_path=$HPARAMAS_PATH \
--out_dir=$OUTPUT_DIR \
--vocab_prefix=$VOCAB_PREFIX \
--train_prefix=$TRAIN_PREFIX \
--dev_prefix=$DEV_PREFIX \
--test_prefix=$TEST_PREFIX
Now when I use the gpu version with the following command I got this error message few minutes after submitting the job.
Train on cloud
gcloud ml-engine jobs submit training $JOB_NAME \
--runtime-version 1.2 \
--job-dir $JOB_DIR \
--package-path nmt/ \
--module-name nmt.nmt \
--scale-tier BAISC_GPU \
--region $REGION \
-- --src=en --tgt=de \
--hparams_path=$HPARAMAS_PATH \
--out_dir=$OUTPUT_DIR \
--vocab_prefix=$VOCAB_PREFIX \
--train_prefix=$TRAIN_PREFIX \
--dev_prefix=$DEV_PREFIX \
--test_prefix=$TEST_PREFIX
Error:
import tensorflow as tf ImportError: No module named tensorflow
setup.py:
from setuptools import find_packages
from setuptools import setup
REQUIRED_PACKAGES = ['tf-nightly-gpu==1.5.0-dev20171115']
setup(
name="nmt",
install_requires=REQUIRED_PACKAGES,
packages=find_packages(),
include_package_data=True,
version='0.1.2'
)
Thank you all in advance
Markus
Update:
I have found a note on
GCP docs
Note: Training with TensorFlow versions 1.3+ is limited to CPUs only. See the Cloud ML Engine release notes for updates.
So it seems to doesn't work currently I think I have to go with the compute engine.
Or is there any hack to got it working?
However thank you for your help
The TensorFlow 1.5 might need newer version of CUDA (i.e., CUDA 9), and but the version CloudML Engine installed is CUDA 8. Can you please try to use TensorFlow 1.4 instead, which works on CUDA 8? Please tell us if 1.4 works for you here or send us an email via cloudml-feedback#google.com

how to install tensorflow on google cloud platform

when I use the command pip install tensorflow the download is only 99% complete and terminated at that point. How can I install tensorflow using google cloud shell.
Instead of installing it by yourself you can use the machine learning api and use TensorFlow for training or inference. Just follow this guidelines: https://cloud.google.com/ml/docs/quickstarts/training
You can submit a TensorFlow job like this:
gcloud beta ml jobs submit training ${JOB_NAME} \
--package-path=trainer \
--module-name=trainer.task \
--staging-bucket="${TRAIN_BUCKET}" \
--region=us-central1 \
-- \
--train_dir="${TRAIN_PATH}/train"

Keras on Google Cloud ML does not seem to use GPU? Is it possible to make it work?

I tried running Keras with tensorflow backend on cloud ml (google cloud platform). I find that keras does not seem to use the GPU. The performance for running one epoch on my CPU is 190 seconds and is equal to what I see in the logs dumped. Is there a way to identify whether a code is running in GPU or CPU in keras? Has anybody tried Keras on Cloud ML with Tensor flow backend running??
Update: As of March of 2017, GPUs are publicly available. See Fuyang Liu's answer
GPUs are not currently available on CloudML. However, they will be in the upcoming months.
yes it is supported now.
Basically you need to add a file such as cloudml-gpu.yaml in your module with the following content:
trainingInput:
scaleTier: CUSTOM
# standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4
GPUs
masterType: standard_gpu
runtimeVersion: "1.0"
Then add a option called --config=trainer/cloudml-gpu.yaml (suppose your training module is in a folder called trainer). For example:
export BUCKET_NAME=tf-learn-simple-sentiment
export JOB_NAME="example_5_train_$(date +%Y%m%d_%H%M%S)"
export JOB_DIR=gs://$BUCKET_NAME/$JOB_NAME
export REGION=europe-west1
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir gs://$BUCKET_NAME/$JOB_NAME \
--runtime-version 1.0 \
--module-name trainer.example5-keras \
--package-path ./trainer \
--region $REGION \
--config=trainer/cloudml-gpu.yaml \
-- \
--train-file gs://tf-learn-simple-sentiment/sentiment_set.pickle
You may also want to checkout this url for the GPU available region and other info on it.
import keras.backend.tensorflow_backend as K
K._set_session(K.tf.Session(config=K.tf.ConfigProto(log_device_placement=True)))
should make keras print the device placement of each tensor to stdout or stderr.