Running distributed Tensorflow on Google Cloud ML engine ClusterSpec - tensorflow-serving

I am trying to run a large distributed tensorflow model on Google Cloud's ML engine and am having trouble understanding what should go on tf.train.ClusterSpec.
When you run a job on Google Cloud you can select the scale tier from BASIC, STANDARD_1, PREMIUM_1, BASIC_GPU or CUSTOM, each giving you access to different types of clusters. However, I can't find the name/addresses of the machines in these clusters.

Please take a look at the documentation and sample here. You should set ClusterSpec using the environment variable TF_CONFIG; e.g.
tf_config = os.environ.get('TF_CONFIG')
# If TF_CONFIG is not available run local
if not tf_config:
return run('', True, *args, **kwargs)
tf_config_json = json.loads(tf_config)
cluster = tf_config_json.get('cluster')
...
cluster_spec = tf.train.ClusterSpec(cluster)

Related

Problem with connecting google Colab with google Cloud TPUs

I have this code which based on t5 notebook (https://colab.research.google.com/github/google-research/text-to-text-transfer-transformer/blob/master/notebooks/t5-trivia.ipynb)
FINETUNE_STEPS = 3000##param {type: "integer"}
model.finetune(
mixture_or_task_name="text_diacritization_short",
pretrained_model_dir=PRETRAINED_DIR,
finetune_steps=FINETUNE_STEPS
)
my code was working fine in 8 Augustus then something happened resulting of this error.
these two lines appeared when my model worked so i don't think they are the problem.
INFO:root:system_path_file_exists:gs://my_bucket/my_file/models/small/operative_config.gin
ERROR:root:Path not found: gs://my_bucket/my_file/models/small/operative_config.gin
Rest of the error.
From /usr/local/lib/python3.7/dist-packages/tensorflow/python/training/training_util.py:399: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.
WARNING:absl:Using an uncached FunctionDataset for training is not recommended since it often results in insufficient shuffling on restarts, resulting in overfitting. It is highly recommended that you cache this task before training with it or use a data source that supports lower-level shuffling (e.g., FileDataSource).
SimdMeshImpl ignoring devices ['', '', '', '', '', '', '', '']
Using default tf glorot_uniform_initializer for variable encoder/block_000/layer_000/SelfAttention/relative_attention_bias The initialzer will guess the input and output dimensions based on dimension order.
Using default tf glorot_uniform_initializer for variable decoder/block_000/layer_000/SelfAttention/relative_attention_bias The initialzer will guess the input and output dimensions based on dimension order.
From /usr/local/lib/python3.7/dist-packages/tensorflow/python/training/saver.py:1161: get_checkpoint_mtimes (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file utilities to get mtimes.
From /usr/local/lib/python3.7/dist-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py:758: Variable.load (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version.
Instructions for updating:
Prefer Variable.assign which has equivalent behavior in 2.X.
I changed the google cloud account and the Colab notebook to completely new gmail account, I think the problem is that something got updated in google Colab regarding connecting to Google Cloud TPUs.
Also, I can connect to my bucket normally using this code.
BASE_DIR = "gs://my_bucket/my_file" ##param { type: "string" }
if not BASE_DIR or BASE_DIR == "gs://":
raise ValueError("You must enter a BASE_DIR.")
DATA_DIR = os.path.join(BASE_DIR, "data")
FINETUNE_MODELS_DIR = os.path.join(BASE_DIR, "models")
ON_CLOUD = True
if ON_CLOUD:
print("Setting up GCS access...")
import tensorflow_gcs_config
from google.colab import auth
# Set credentials for GCS reading/writing from Colab and TPU.
TPU_TOPOLOGY = "v2-8"
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection
TPU_ADDRESS = tpu.get_master()
print('Running on TPU:', TPU_ADDRESS)
except ValueError:
raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
auth.authenticate_user()
tf.enable_eager_execution()
tf.config.experimental_connect_to_host(TPU_ADDRESS)
tensorflow_gcs_config.configure_gcs_from_colab_auth()
tf.disable_v2_behavior()
# Improve logging.
from contextlib import contextmanager
import logging as py_logging
if ON_CLOUD:
tf.get_logger().propagate = False
py_logging.root.setLevel('INFO')
#contextmanager
def tf_verbosity_level(level):
og_level = tf.logging.get_verbosity()
tf.logging.set_verbosity(level)
yield
tf.logging.set_verbosity(og_level)
it would be great if someone can help me I have been looking in the issue for a week and found nothing, is there any changes to how Google Colab works that I am not aware of.
Thanks in advance.

What do you use to access CSV data on S3 and other object storage providers as a PyTorch Dataset?

My dataset is stored as a collection of CSV files in an Amazon Web Services (AWS) Simple Storage Service (S3) bucket. I'd like to train a PyTorch model based on this data but the built-in Dataset classes do not provide native support for object storage services like S3 or Google Cloud Storage (GCS), Azure Blob storage, and such. I checked the PyTorch documentation here https://pytorch.org/docs/stable/data.html# about the available Dataset classes and it comes up short when it comes to public cloud object storage support.
It looks like I have to create my own custom Dataset according to the following instructions: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class but the effort seems overwhelming: I need to figure out how to download data from the object storage to local node, parse the CSV files to read them into PyTorch tensors, and then deal with the possibility of running out of disk space since my dataset is 100s of GBs.
Since PyTorch models are trained using gradient descent and I only need to store just a small batch of data (less than 1GB) in memory at once, is there a custom dataset implementation that can help?
Check out ObjectStorage Dataset which has support for object storage services like S3 and GCS osds.readthedocs.io/en/latest/gcs.html
You can run
pip install osds
to install it and then point it at your S3 bucket to instantiate the PyTorch Dataset and DataLoader using something like
from osds.utils import ObjectStorageDataset
from torch.utils.data import DataLoader
ds = ObjectStorageDataset(f"gcs://gs://cloud-training-demos/taxifare/large/taxi-train*.csv",
storage_options = {'anon' : False },
batch_size = 32768,
worker = 4,
eager_load_batches = False)
dl = DataLoader(ds, batch_size=None)
where you use your S3 location path instead of gcs://gs://cloud-training-demos/taxifare/large/taxi-train*.csv. So your glob for S3 would be something like s3://<bucket name>/<object path>/*.csv depending on the bucket and the bucket directory where you store your CSV objects for the dataset.

Having issues reading S3 bucket when transitioning a tensorflow model from local machine to AWS SageMaker

When testing on a local machine in Python I would normally use the following to read a training set with sub-directories of all the classes and files/class:
train_path = r"C:\temp\coins\PCGS - Gold\train"
train_batches = ImageDataGenerator().flow_from_directory(train_path, target_size=(100,100), classes=['0','1',2','3' etc...], batch_size=32)
Found 4100 images belonging to 22 classes.
but on AWS SageMaker's Jupyter notebook I am now pulling the files from an S3 bucket. I tried the following:
bucket = "coinpath"
train_path = 's3://{}/{}/train'.format(bucket, "v1") #note that the directory structure is coinpath/v1/train where coinpath is the bucket
train_batches = ImageDataGenerator().flow_from_directory(train_path, target_size=(100,100), classes=
['0','1',2','3' etc...], batch_size=32)
but I get: ** Found 0 images belonging to 22 classes.**
Looking for some guidance on the right way to pull training data from S3.
From Ideal way to read data in bucket stored batches of data for Keras ML training in Google Cloud Platform? "ImageDataGenerator.flow_from_directory() currently does not allow you to stream data directly from a GCS bucket. "
I had to download the image from S3 first. This is best for latency reasons as well.

Google Storage (gs) wrapper file input/out for Cloud ML?

Google recently announced the Clould ML, https://cloud.google.com/ml/ and it's very useful. However, one limitation is that the input/out of a Tensorflow program should support gs://.
If we use all tensorflow APIS to read/write files, it should OK, since these APIs support gs://.
However, if we use native file IO APIs such as open, it does not work, because they don't understand gs://
For example:
with open(vocab_file, 'wb') as f:
cPickle.dump(self.words, f)
This code won't work in Google Cloud ML.
However, modifying all native file IO APIs to tensorflow APIs or Google Storage Python APIs is really tedious. Is there any simple way to do this? Any wrappers to support google storage systems, gs:// on top of the native file IO?
As suggested here Pickled scipy sparse matrix as input data?, perhaps we can use file_io.read_file_to_string('gs://...'), but still this requrements significant code modifcation.
Do it like this:
from tensorflow.python.lib.io import file_io
with file_io.FileIO('gs://.....', mode='w+') as f:
cPickle.dump(self.words, f)
Or you can read pickle file in like this:
file_stream = file_io.FileIO(train_file, mode='r')
x_train, y_train, x_test, y_test = pickle.load(file_stream)
One solution is to copy all of the data to local disk when the program starts up. You can do that using gsutil inside the Python script that gets run, something like:
vocab_file = 'vocab.pickled'
subprocess.check_call(['gsutil', '-m' , 'cp', '-r',
os.path.join('gs://path/to/', vocab_file), '/tmp'])
with open(os.path.join('/tmp', vocab_file), 'wb') as f:
cPickle.dump(self.words, f)
And if you have any outputs, you can write them to local disk and gsutil rsync them. (But, be careful to handle restarts correctly, because you may be put on a different machine).
The other solution is to monkey patch open (Note: untested):
import __builtin__
# NB: not all modes are compatible; should handle more carefully.
# Probably should be reported on
# https://github.com/tensorflow/tensorflow/issues/4357
def new_open(name, mode='r', buffering=-1):
return file_io.FileIO(name, mode)
__builtin__.open = new_open
Just be sure to do that before any module actually tries to read from GCS.
apache_beam has the gcsio module which can be used to return a standard Python file object to read/write GCS objects. You can use this object with any method that works with Python file objects. For example
def open_local_or_gcs(path, mode):
"""Opens the given path."""
if path.startswith('gs://'):
try:
return gcsio.GcsIO().open(path, mode)
except Exception as e: # pylint: disable=broad-except
# Currently we retry exactly once, to work around flaky gcs calls.
logging.error('Retrying after exception reading gcs file: %s', e)
time.sleep(10)
return gcsio.GcsIO().open(path, mode)
else:
return open(path, mode)
with open_local_or_gcs(vocab_file, 'wb') as f:
cPickle.dump(self.words, f)

Multiple GPU training with tensorfow.slim.learning

The new learning module in tensorflow.contrib.slim looks very promising :
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/slim/python/slim/learning.py
I'm trying to figure out how can I reproduce the CIFAR 10 multi-gpu example (or the ImageNet example) using this new module on a configuration where I have only a single worker node but with several GPU on it.
I've had some success using https://github.com/tensorflow/models/tree/master/slim/deployment
When creating the config object you would set num_clones = [num_gpus]
For example,
config = model_deploy.DeploymentConfig(num_clones=2)
Check out the example in model_deploy.py file.