Google Cloud ML Engine can't locate local TFRecords - tensorflow

I am trying to use Google Cloud ML Engine to optimize hyperparameters for my variational autoencoder model, but the job fails because the .tfrecord files I specify for my input are not found. In my model code, I pass train.tfrecords to my input tensor as in the canonical cifar10 example and specify the location of train.tfrecords with the full path.
Relevant information:
JOB_DIR points to the trainer directory
This image shows my directory structure
My setup.py file is below:
from setuptools import find_packages
from setuptools import setup
REQUIRED_PACKAGES = ['tensorflow==1.3.0', 'opencv-python']
setup(
name='trainer',
version='0.1',
install_requires=REQUIRED_PACKAGES,
packages=find_packages(),
include_package_data=True,
description='My trainer application package.'
)

When the job executes it will not be able to read data from your local machine. The easiest way to make TFRecord files available to your job is by copying them to GCS and then passing the location of the GCS files to your program as flags and using those flags to configure your readers and writers.

Related

There is a way to read the images for my convolutional neural network directly from my desktop?

I'm training a Convolutional Neural Network using Google-Colaboratory. I have my data (images) stored in Google Drive and I'm able to use it correctly. However sometimes the process to read the images is too slow and does not work (other times the process is faster and I have no problem reading the images). In order to read the images from Google Drive I use:
from google.colab import drive
drive.mount('/content/drive')
!unzip -u "/content/drive/My Drive/the folder/files.zip"
IMAGE_PATH = '/content/drive/My Drive/the folder'
file_paths = glob.glob(path.join(IMAGE_PATH, '*.png'))
and sometimes works and other times not or it is too slow :).
Either way I would like to read my data from a folder on my desktop without using google drive but I'm not able to do this.
I'm trying the following:
IMAGE_PATH = 'C:/Users/path/to/my/folder'
file_paths = glob.glob(path.join(IMAGE_PATH, '*.png'))
But I get an error saying that the directory/file does not exist.
Google Colab cannot directly access our local machine dataset because it runs on a separate virtual machine on the cloud. We need to upload the dataset into Google Drive then we can load it into Google Colab’s runtime for model building.
For that you need to follow the steps given below:
Create a zip file of your large dataset and then upload this file in your Google Drive.
Now, open the Google Colab with the same google id and mount the Google Drive using the below code and authorize to access the drive:
from google.colab import drive
drive.mount('/content/drive')
Your uploaded zip file will be available in the Google Colab mounted drive /drive/MyDrive/ in left pane.
To read the dataset into the Google Colab, you need to unzip the folder and extract its contents into the /tmp folder using the code below.
import zipfile
import os
zip_ref = zipfile.ZipFile('/content/drive/MyDrive/train.zip', 'r') #Opens the zip file in read mode
zip_ref.extractall('/tmp') #Extracts the files into the /tmp folder
zip_ref.close()
You can check the extracted file in /drive/train folder in left pane.
Now finally you need to join the path of your dataset to use it in the Google Colab's runtime environment.
train_dataset = os.path.join('/tmp/train/') # dataset

TF1.14][TPU]Can not use custom TFrecord dataset on Colab using TPU

I have created a TFRecord dataset file consisting elements and their corresponding labels. I want to use it for training model on Colab using free TPU. I can load the TFRecord file and even run an iterator just to see the contents however, before the beginning of the epoch it throws following error-
UnimplementedError: From /job:worker/replica:0/task:0:
File system scheme '[local]' not implemented (file: '/content/gdrive/My Drive/data/encodeddata_inGZIP.tfrecord')
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional_1]]
In my understanding, it wants the TFRecord file on the TPU bucket, I don't know how to do that on Colab. How can one use a TFRecord file directly on Colab TPU?
You need to host it on Google Cloud Storage:
All input files and the model directory must use a cloud storage bucket path (gs://bucket-name/...), and this bucket must be accessible from the TPU server. Note that all data processing and model checkpointing is performed on the TPU server, not the local machine.
As mentioned on Google's troubleshooting page: https://cloud.google.com/tpu/docs/troubleshooting#cannot_use_local_filesystem
Hope this helps!

How to fix trainer package not found in AI Platform GPU-distributed training job

I'm attempting to train a Tensorflow Estimator on AI Platform. The model trains on local perfectly fine, albeit extremely slowly, but right when I try to run distributed-GPU training on AI Platform I run into this error:
CommandException: No URLs matched: gs://path/.../trainer-0.1.tar.gz
I have my code packaged with the trainer module as recommended by Google Cloud AI Platform. Any help would be appreciated!
I was actually able to fix my issue: it appears that if I don't set up a staging bucket then the model dir where checkpoints are stored will overwrite the trainer package before the worker replicas are able to download the trainer! I'm unsure how the checkpoints were even able to begin being stored when the worker replicas hadn't all downloaded the trainer yet, but adding the staging bucket that was different from my model dir fixed this.

Generate SavedModel from Tensorflow model to serve it on Google Cloud ML

I used TF Hub to retrain a model for image classification. Now I would like to serve it in the cloud. For that i need a SavedModel. The retrain.py script from TF Hub uses tf.saved_model.simple_save to generate the SavedModel after the training is done.
What confuses me is the .pb file inside the SavedModel folder that I get from that method is much smaller than the final .pb saved after the training.
simple_save is also now deprecated and I tried to get my SavedModel after the training is done following this SO issue.
But my variables folder is empty. How can I incorporate that building of SavedModel inside the retrain.py to replace the simple_save method ? Tips would be much appreciated.
To deploy your model to Google Cloud ML, you need a SavedModel which can be produced from tf.saved_model api.
Below are the steps for hosting your trained models in cloud with Cloud ML Engine.
Upload your saved model to a Cloud Storage bucket by setting up a cloud storage bucket using BUCKET_NAME="your_bucket_name"
Select a region for your bucket and set a REGION environment variable.
EGION=us-central1
Create a new bucket gsutil mb -l $REGION gs://$BUCKET_NAME
Upload using
SAVED_MODEL_DIR=$(ls ./your-export-dir-base | tail -1)
gsutil cp -r $SAVED_MODEL_DIR gs://your-bucket
Create a Cloud ML Engine model resource and model version.
Also for your question on incorporating savedmodel inside retrain.py, you need to pass saved model as an argument to the tfhub_module line as below.
python retrain.py --image_dir C: ...\\code\\give the path here --tfhub_module C:
...give the path to saved model directory

How do you install modules within sagemaker training jobs?

I don't think I'm asking this question right but I have jupyter notebook that launches a Tensorflow training job with a python training script I wrote.
That training script requires certain modules. Seems my sagemaker training job is failing because some of the modules don't exist.
How can I ensure that my training job script has all the modules it needs?
Edit
An example of one of these modules is keras.
The odd thing is, I can import keras in the jupyter notebook, but when that import statement is in my training script then I get the No module named keras error
If you want to install multiple packages, one way is to upgrade to Sagemaker Python SDK v2. With this, you can create a requirements.txt in the same directory as your notebook, and run the training. Sagemaker will automatically take care of the installation.
If you want to stay on v1 SDK, you can add the following snippet to your entry_point script.
import subprocess
import sys
def install(package):
subprocess.check_call([sys.executable, "-q", "-m", "pip", "install", package])
install('keras')
The module script runs within a docker container which obviously does not have the dependency installed. Jupyter notebook on the other hand has keras pre-installed.
Easy way to do this is to have a requirements.txt file with all the requirements and then pass that on when creating your model.
env = {
'SAGEMAKER_REQUIREMENTS': 'requirements.txt', # path relative to `source_dir` below.
}
sagemaker_model = TensorFlowModel(model_data = 's3://mybucket/modelTarFile,
role = role,
entry_point = 'entry.py',
code_location = 's3://mybucket/runtime-code/',
source_dir = 'src',
env = env,
name = 'model_name',
sagemaker_session = sagemaker_session,
)
You can upload your requirements.txt file to s3 bucket which can be
accessible by sagemaker and download the file to your working
directory of the container using boto3. Install the libraries from
requirements.txt the entry file.
import os
import boto3
s3 = boto3.client('s3')
s3.download_file('BUCKET_NAME', 'OBJECT_NAME', '/opt/ml/code/requirements.txt')
os.command('pip install -r /opt/ml/code/requirements.txt')
The other way you can do it is by building your own container using
bring your own algorithm option provided by aws.
Ref-links:
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb
The EstimatorBase class (and TensorFlow class) accept the parameter dependencies which you can use as follows to pass your requirements.txt:
estimator = TensorFlow(
dependencies=['requirements.txt'], # copies this file
)
e.g.
estimator = TensorFlow(
entry_point='src/train.py',
dependencies=['requirements.txt'], # copies this file
)
or
estimator = TensorFlow(
source_dir='src', # this copies the entire src folder
entry_point='train.py', # when using source_dir has to be directly under that dir
dependencies=['requirements.txt'], # copies this file
)
This copies the requirements.txt file into your sourcedir.tar.gz along with the training code.
This may only work on newer image versions. I read that in older versions you may need to put the requirements.txt file in the same folder as your training code.
If this doesn't work, you can use pip download to download your dependencies defined in requirements.txt locally, then use the dependencies parameter to specify the folder to which you downloaded your dependencies.
Another option is in your entry_point .py file you can add
import os
if __name__ == "__main__":
os.system('pip install mymodule')
import mymodule
# rest of code goes here
This worked for me for simple modules such as pyparsing, but I think with keras you better just use a Tensorflow container that has keras preinstalled, as mentioned above.
The environment on your notebook instance is exclusive from the environment of your training job on SageMaker, unless it is local mode.
If you're using a custom docker image, then most likely your docker image doesn't have Keras installed.
If you are using the SageMaker predefined TensorFlow container, which is most likely invoked through the following code:
https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/estimator.py#L170
TensorFlow(entry_point='training_code.py',
blah,
blah
)
Then you will need to install your dependencies within that container. There are currently two modes for training for TensorFlow on SageMaker, "framework" and "script" mode.
If training through "framework" mode, which is only available with 1.12 and below, then you will be limited to using a keras_model_fn defined here:
https://github.com/aws/sagemaker-python-sdk/tree/v1.12.0/src/sagemaker/tensorflow#preparing-the-tensorflow-training-script
Installing your dependencies would be done by passing in a requirements.txt.
On "script mode", which is introduced with TensorFlow 1.11 and above:
https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/tensorflow#training-with-tensorflow
Requirements.txt is not supported for "script" mode and instead it is recommended to install your dependencies within your user script, which would be your Python file that contains all of your Keras code.
Please let me know if there is anything I can clarify.
For examples:
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/tensorflow_script_mode_quickstart
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/tensorflow_iris_dnn_classifier_using_estimators