Save Keras ModelCheckpoints in Google Cloud Bucket - tensorflow

I'm working on training a LSTM network on Google Cloud Machine Learning Engine using Keras with TensorFlow backend. I managed it to deploy my model and perform a successful training task after some adjustments to the gcloud and my python script.
I then tried to make my model save checkpoints after every epoch using Keras modelCheckpoint callback. Running a local training job with Google Cloud works perfectly as expected. The weights are getting stored in the specified path after each epoch. But when I try to run the same job online on Google Cloud Machine Learning Engine the weights.hdf5 does not get written to my Google Cloud Bucket. Instead I get the following error:
...
File "h5f.pyx", line 71, in h5py.h5f.open (h5py/h5f.c:1797)
IOError: Unable to open file (Unable to open file: name =
'gs://.../weights.hdf5', errno = 2, error message = 'no such file or
directory', flags = 0, o_flags = 0)
I investigated this issue and it turned out, that there is no Problem with the the Bucket itself, as Keras Tensorboard callback does work fine and writes the expected output to the same bucket. I also made sure that h5py gets included by providing it in the setup.py located at:
├── setup.py
└── trainer
├── __init__.py
├── ...
The actual include in setup.py is shown below:
# setup.py
from setuptools import setup, find_packages
setup(name='kerasLSTM',
version='0.1',
packages=find_packages(),
author='Kevin Katzke',
install_requires=['keras','h5py','simplejson'],
zip_safe=False)
I guess the problem comes down to the fact that GCS cannot be accessed with Pythons open for I/O since it instead provides a custom implementation:
import tensorflow as tf
from tensorflow.python.lib.io import file_io
with file_io.FileIO("gs://...", 'r') as f:
f.write("Hi!")
After checking how Keras modelCheckpoint callback implements the actual file writing and it turned out, that it is using h5py.File() for I/O:
with h5py.File(filepath, mode='w') as f:
f.attrs['keras_version'] = str(keras_version).encode('utf8')
f.attrs['backend'] = K.backend().encode('utf8')
f.attrs['model_config'] = json.dumps({
'class_name': model.__class__.__name__,
'config': model.get_config()
}, default=get_json_type).encode('utf8')
And as the h5py package is a Pythonic interface to the HDF5 binary data format the h5py.File() seems to call an underlying HDF5 functionality written in Fortran as far as I can tell: source, documentation.
How can I fix this and make the modelCheckpoint callback write to my GCS Bucket? Is there a way for "monkey patching" to somehow overwrite how a hdf5 file is opened to make it use GCS's file_io.FileIO()?

I might be a bit late on this, but for the sake of future visitors I would describe the whole process of how to adapt the code that was previously run locally to be GoogleML-aware from the IO point of view.
Python standard open(file_name, mode) does not work with buckets (gs://...../file_name). One needs to from tensorflow.python.lib.io import file_io and change all calls to open(file_name, mode) to file_io.FileIO(file_name, mode=mode) (note the named mode parameter). The interface of the opened handle is the same.
Keras and/or other libraries mostly use standard open(file_name, mode) internally. That said, something like trained_model.save(file_path) calls to 3rd-party libraries will fail to store the result to the bucket. The only way to retrieve a model after the job has finished successfully would be to store it locally and then move to the bucket.
The code below is quite inefficient, because it loads the whole model at once and then dumps it to the bucket, but it worked for me for relatively small models:
model.save(file_path)
with file_io.FileIO(file_path, mode='rb') as if:
with file_io.FileIO(os.path.join(model_dir, file_path), mode='wb+') as of:
of.write(if.read())
The mode must be set to binary for both reading and writing.
When the file is relatively big, it makes sense to read and write it in chunks to decrease memory consumption.
Before running a real task, I would advise to run a stub that simply saves a file to remote bucket.
This implementation, temporarily put instead of real train_model call, should do:
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--job-dir',
help='GCS location with read/write access',
required=True
)
args = parser.parse_args()
arguments = args.__dict__
job_dir = arguments.pop('job_dir')
with file_io.FileIO(os.path.join(job_dir, "test.txt"), mode='wb+') as of:
of.write("Test passed.")
After a successful execution you should see the file test.txt with a content "Test passed." in your bucket.

The issue can be solved with the following piece of code:
# Save Keras ModelCheckpoints locally
model.save('model.h5')
# Copy model.h5 over to Google Cloud Storage
with file_io.FileIO('model.h5', mode='r') as input_f:
with file_io.FileIO('model.h5', mode='w+') as output_f:
output_f.write(input_f.read())
print("Saved model.h5 to GCS")
The model.h5 is saved on local filesystem and the copied over to GCS. As Jochen pointed out, there currently is no easy support to write HDF5 model checkpoints to GCS. With this hack it is possible to write the data until an easier solution is provided.

I faced a similar problem and the solution above didn't work for me. The file must be read and be written in binary form. Otherwise this error will be thrown.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
So the code will be
def copy_file_to_gcs(job_dir, file_path):
with file_io.FileIO(file_path, mode='rb') as input_f:
with file_io.FileIO(os.path.join(job_dir, file_path), mode='wb+') as output_f:
output_f.write(input_f.read())

Here is my code that I wrote to save the model after each epoch.
import os
import numpy as np
import warnings
from keras.callbacks import ModelCheckpoint
class ModelCheckpointGC(ModelCheckpoint):
"""Taken from and modified:
https://github.com/keras-team/keras/blob/tf-keras/keras/callbacks.py
"""
def on_epoch_end(self, epoch, logs=None):
logs = logs or {}
self.epochs_since_last_save += 1
if self.epochs_since_last_save >= self.period:
self.epochs_since_last_save = 0
filepath = self.filepath.format(epoch=epoch, **logs)
if self.save_best_only:
current = logs.get(self.monitor)
if current is None:
warnings.warn('Can save best model only with %s available, '
'skipping.' % (self.monitor), RuntimeWarning)
else:
if self.monitor_op(current, self.best):
if self.verbose > 0:
print('Epoch %05d: %s improved from %0.5f to %0.5f,'
' saving model to %s'
% (epoch, self.monitor, self.best,
current, filepath))
self.best = current
if self.save_weights_only:
self.model.save_weights(filepath, overwrite=True)
else:
if is_development():
self.model.save(filepath, overwrite=True)
else:
self.model.save(filepath.split(
"/")[-1])
with file_io.FileIO(filepath.split(
"/")[-1], mode='rb') as input_f:
with file_io.FileIO(filepath, mode='wb+') as output_f:
output_f.write(input_f.read())
else:
if self.verbose > 0:
print('Epoch %05d: %s did not improve' %
(epoch, self.monitor))
else:
if self.verbose > 0:
print('Epoch %05d: saving model to %s' % (epoch, filepath))
if self.save_weights_only:
self.model.save_weights(filepath, overwrite=True)
else:
if is_development():
self.model.save(filepath, overwrite=True)
else:
self.model.save(filepath.split(
"/")[-1])
with file_io.FileIO(filepath.split(
"/")[-1], mode='rb') as input_f:
with file_io.FileIO(filepath, mode='wb+') as output_f:
output_f.write(input_f.read())
There is a function is_development() that checks if it is the local or gcloud environment. On the local environment I did set the variable LOCAL_ENV=1:
def is_development():
"""check if the environment is local or in the gcloud
created the local variable in bash profile
export LOCAL_ENV=1
Returns:
[boolean] -- True if local env
"""
try:
if os.environ['LOCAL_ENV'] == '1':
return True
else:
return False
except:
return False
Then you can use it:
ModelCheckpointGC(
'gs://your_bucket/models/model.h5',
monitor='loss',
verbose=1,
save_best_only=True,
mode='min'))
I hope that helps someone and saves some time.

A hacky workaround is to save to local filesystem, then copy using TF IO API. I added an example to the Keras example on GoogleCloudPlatform ML samples.
Basically it checks whether the target directory is a GCS path ("gs://") and will force writing h5py to the local filesystem, then copy to GCS using the TF file_io API. See for example: https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/census/keras/trainer/task.py#L146

For me the easiest way is to use gsutil.
model.save('model.h5')
!gsutil -m cp model.h5 gs://name-of-cloud-storage/model.h5

I am not sure why this is not mentioned already, but there is a solution where you don't need to add a copy function in your code.
Install gcsfuse following these steps:
export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb http://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install gcsfuse
Then mount your bucket locally:
mkdir bucket
gcsfuse <cloud_bucket_name> bucket
and then use the local directory bucket/ as the logdir of your model.
Syncing of cloud and local directory will be automated for you and your code can stay clean.
Hope it helps :)

tf.keras.models.save_model(model, filepath, save_format="tf")
save_format: Either 'tf' or 'h5', indicating whether to save the model to Tensorflow SavedModel or HDF5. Defaults to 'tf' in TF 2.X, and 'h5' in TF 1.X.

Related

How to prevent Kaggle re-downloading model files each time session is ended and restarted?

I want to keep downloaded model data in a kaggle notebook
Here example kaggle notebook of mine : https://www.kaggle.com/furkangozukara/tglobal-xl-booksum-wip3r3
Whenever session is ended and restarted, it redownloads all of the model data from huggingface
For example the below image displays the model data download from the imported repository : https://huggingface.co/pszemraj/long-t5-tglobal-large-pubmed-3k-booksum-16384-WIP/tree/main
You can use the /kaggle/working directory, which is a persistent storage location in Kaggle's environment. Save your model files there, and they will persist across sessions.
Save:
model = # download from huggingface the 1st time #
tokenizer = # download from huggingface the 1st time #
...
import os, shutil
model_path = os.path.join('/kaggle/working', "YOUR_MODEL_NAME")
if os.path.exists(model_path): shutil.rmtree(model_path)
os.mkdir(model_path)
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)
Usage:
AutoModelForSeq2SeqLM.from_pretraiend(model_path)
AutoTokenizer.from_pretraiend(model_path)

Can't use Keras CSVLogger callbacks in Sagemaker script mode. It fails to write the log file on S3 ( error - No such file or directory )

I have this script where I want to get the callbacks to a separate CSV file in sagemaker custom script docker container. But when I try to run in local mode, it fails giving the following error. I have a hyper-parameter tuning job(HPO) to run and this keeps giving me errors. I need to get this local mode run correctly before doing the HPO.
In the notebook I use the following code.
from sagemaker.tensorflow import TensorFlow
tf_estimator = TensorFlow(entry_point='lstm_model.py',
role=role,
code_location=custom_code_upload_location,
output_path=model_artifact_location+'/',
train_instance_count=1,
train_instance_type='local',
framework_version='1.12',
py_version='py3',
script_mode=True,
hyperparameters={'epochs': 1},
base_job_name='hpo-lstm-local-test'
)
tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path})
In my lstm_model.py script the following code is used.
lgdir = os.path.join(model_dir, 'callbacks_log.csv')
csv_logger = CSVLogger(lgdir, append=True)
regressor.fit(x_train, y_train, batch_size=batch_size,
validation_data=(x_val, y_val),
epochs=epochs,
verbose=2,
callbacks=[csv_logger]
)
I tried creating a file before hand like shown below using tensorflow backend. But it doesn't create a file. ( K : tensorflow Backend, tf: tensorflow )
filename = tf.Variable(lgdir , tf.string)
content = tf.Variable("", tf.string)
sess = K.get_session()
tf.io.write_file(filename, content)
I can't use any other packages like pandas to create the file as the TensorFlow docker container in SageMaker for custom scripts doesn't provide them. They give only a limited amount of packages.
Is there a way I can write the csv file to the S3 bucket location, before the fit method try to write the callback. Or is that the solution to the problem? I am not sure.
If you can even suggest other suggestions to get callbacks, I would even accept that answer. But it should be worth the effort.
This docker image is really narrowing the scope.
Well for starters, you can always make your own docker image using the Tensorflow image as a base. I work in Tensorflow 2.0 so this will be slightly different for you but here is an example of my image pattern:
# Downloads the TensorFlow library used to run the Python script
FROM tensorflow/tensorflow:2.0.0a0 # you would use the equivalent for your TF version
# Contains the common functionality necessary to create a container compatible with Amazon SageMaker
RUN pip install sagemaker-containers -q
# Wandb allows us to customize and centralize logging while maintaining open-source agility
RUN pip install wandb -q # here you would install pandas
# Copies the training code inside the container to the design pattern created by the Tensorflow estimator
# here you could copy over a callbacks csv
COPY mnist-2.py /opt/ml/code/mnist-2.py
COPY callbacks.py /opt/ml/code/callbacks.py
COPY wandb_setup.sh /opt/ml/code/wandb_setup.sh
# Set the login script as the entry point
ENV SAGEMAKER_PROGRAM wandb_setup.sh # here you would instead launch lstm_model.py
I believe you are looking for a pattern similar to this, but I prefer to log all of my model data using Weights and Biases. They're a little out of data on their SageMaker integration but I'm actually in the midst of writing an updated tutorial for them. It should certainly be finished this month and include logging and comparing runs from hyperparameter tuning jobs

How do I create a Sagemaker training job with my own Tensorflow code without having to build a container?

I'm trying to define a Sagemaker Training Job with an existing Python class. To my understanding, I could create my own container but would rather not deal with container management.
When choosing "Algorithm Source" there is the option of "Your own algorithm source" but nothing is listed under resources. Where does this come from?
I know I could do this through a notebook, but I really want this defined in a job that can be invoked through an endpoint.
As Bruno has said you will have to use a container somewhere, but you can use an existing container to run your own custom tensorflow code.
There is a good example in the sagemaker github for how to do this.
The way this works is you modify your code to have an entry point which takes argparse command line arguments, and then you point a 'Sagemaker Tensorflow estimator' to the entry point. Then when you call fit on the sagemaker estimator it will download the tensorflow container and run your custom code in there.
So you start off with your own custom code that looks something like this
# my_custom_code.py
import tensorflow as tf
import numpy as np
def build_net():
# single fully connected
image_place = tf.placeholder(tf.float32, [None, 28*28])
label_place = tf.placeholder(tf.int32, [None,])
net = tf.layers.dense(image_place, units=1024, activation=tf.nn.relu)
net = tf.layers.dense(net, units=10, activation=None)
return image_place, label_place, net
def process_data():
# load
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
# center
x_train = x_train / 255.0
m = x_train.mean()
x_train = x_train - m
# convert to right types
x_train = x_train.astype(np.float32)
y_train = y_train.astype(np.int32)
# reshape so flat
x_train = np.reshape(x_train, [-1, 28*28])
return x_train, y_train
def train_model(init_learn, epochs):
image_p, label_p, logit = build_net()
x_train, y_train = process_data()
loss = tf.nn.softmax_cross_entropy_with_logits_v2(
logits=logit,
labels=label_p)
optimiser = tf.train.AdamOptimizer(init_learn)
train_step = optimiser.minimize(loss)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for _ in range(epochs):
sess.run(train_step, feed_dict={image_p: x_train, label_p: y_train})
if __name__ == '__main__':
train_model(0.001, 10)
To make it work with sagemaker we need to create a command line entry point, which will allow sagemaker to run it in the container it will download for us eventually.
# entry.py
import argparse
from my_custom_code import train_model
if __name__ == '__main__':
parser = argparse.ArgumentParser(
formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument(
'--model_dir',
type=str)
parser.add_argument(
'--init_learn',
type=float)
parser.add_argument(
'--epochs',
type=int)
args = parser.parse_args()
train_model(args.init_learn, args.epochs)
Apart from specifying the arguments my function needs to take, we also need to provide a model_dir argument. This is always required, and is an S3 location which is where an model artifacts will be saved when the training job completes. Note that you don't need to specify what this value is (though you can) as Sagemaker will provide a default location in S3 for you.
So we have modified our code, now we need to actually run it on Sagemaker. Go to the AWS console and fire up a small instance from Sagemaker. Download your custom code to the instance, and then create a jupyter notebook as follows:
# sagemaker_run.ipyb
import sagemaker
from sagemaker.tensorflow import TensorFlow
hyperparameters = {
'epochs': 10,
'init_learn': 0.001}
role = sagemaker.get_execution_role()
source_dir = '/path/to/folder/with/my/code/on/instance'
estimator = TensorFlow(
entry_point='entry.py',
source_dir=source_dir,
train_instance_type='ml.t2.medium',
train_instance_count=1,
hyperparameters=hyperparameters,
role=role,
py_version='py3',
framework_version='1.12.0',
script_mode=True)
estimator.fit()
Running the above will:
Spin up an ml.t2.medium instance
Download the tensorflow 1.12.0 container to the instance
Download any data we specify in fit to the newly created instance in fit (in this case nothing)
Run our code on the instance
upload the model artifacts to model_dir
And that is pretty much it. There is of course a lot not mentioned here but you can:
Download training/testing data from s3
Save checkpoint files, and tensorboard files during training and upload them to s3
The best resource I found was the example I shared but here are all the things I was looking at to get this working:
example code again
documentation
explanation of environment variables
I believe this is not possible as you may refer from this part on SageMaker documentation. A container is needed for providing the capability to run with any language and framework.
The algorithms that are listed in training job creation, are the algorithms you can create in SageMaker -> Training -> Algorithms. But, it's necessary to define a container, which is a specification of how you can do training and predictions. Even if you don't build a container, you will refer to an existing one (using a built-in algorithm) or you will be using an algorithm from marketplace wich someones built that using an image.
I believe you could build an image that attends your needs parting from an existing one.
After you build the image, you can easily use it to automate your training/prediction jobs from lambda. Here is an example.
Also, you can provide as many input channels to your container as you need to load data, in theory, you can pass a channel that refers to a script that you want to load while your container starts. But is an idea that I've just had, depending on your scenario, could be worth a test. Normally, you could have an image which you could customize during the docker build process. So, if have several different scripts, you can create only one image and just parameterize it to use a custom script.
Here you can find a custom image that uses Tensorflow.
Here are listed a lot of examples of building different containers for several frameworks, also, Tensorflow.
I hope it helps, let me know if need more info.
Regards.
I am not sure if this helps you, but you can make use of Tensorflow estimators which is a like an inbuilt container from AWS. You need a training script and requirements.txt file which will contain the dependencies that you may need. You can follow this link for more information Sagemaker TensorFlow estimators documentation

How to read a file during training in AWS SageMaker?

I'm trying to train a custom tensorflow model, using AWS SageMaker. Thus, in the model_fn method, that I should provide, I want to be able to read an external file. I've uploaded the file to S3 and try to read like below:
BUCKET_PATH = 's3://<bucket_name>/data/<prefix>/'
def model_fn(features, labels, mode, params):
# Load vocabulary
vocab_path = os.path.join(BUCKET_PATH, 'vocab.pkl')
with open(vocab_path, 'rb') as f:
vocab = pickle.load(f)
n_vocab = len(vocab)
...
I get an IOError: [Errno 2] No such file or directory
How can I read this file during training?
I don't think pickle.load can ping an S3 bucket. You can either keep the data in the python notebook path or download it using boto3 client.
Moreover, you'd probably not want to download it in model_fn. That would be called for each epoch. Generally data is loaded and prepared in the train_input_fn.

How to create a SavedModel from a TensorFlow checkpoint or model?

I was given a TensorFlow checkpoint and also an exported model, but to serve a model using Google ML Cloud, I need a saved_model.pbtxt file. It seems that I need to load the checkpoint and use SavedModelBuilder but SavedModelBuilder wants a dictionary of the names of the input and output nodes.
My question is, given the checkpoint or the exported model (below), how can I find the names of the nodes needed to generate the pbtxt file I need to serve the model via Google's ML Cloud service?
checkpoint
export.data-00000-of-00001
export.index
export.meta
options.json
The export.meta should be a MetaGraphDef proto. So you should be able to parse the proto to get the graph. You can then search through the nodes to find the node of interest.
Something like:
import argparse
from tensorflow.core.protobuf import meta_graph_pb2
import logging
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description='Argument parser.')
parser.add_argument('--path',
required=True,
help='The path to the metadata graph file.')
args = parser.parse_args()
with open(args.path, 'r') as hf:
graph = meta_graph_pb2.MetaGraphDef.FromString(hf.read())
print "graph: \n{0}".format(graph)
I think you should also be able to point TensorBoard at the directory containing that file and TensorBoard will render the graph and use that to identify the names of the input/output nodes.