How to prevent Kaggle re-downloading model files each time session is ended and restarted? - kaggle

I want to keep downloaded model data in a kaggle notebook
Here example kaggle notebook of mine : https://www.kaggle.com/furkangozukara/tglobal-xl-booksum-wip3r3
Whenever session is ended and restarted, it redownloads all of the model data from huggingface
For example the below image displays the model data download from the imported repository : https://huggingface.co/pszemraj/long-t5-tglobal-large-pubmed-3k-booksum-16384-WIP/tree/main

You can use the /kaggle/working directory, which is a persistent storage location in Kaggle's environment. Save your model files there, and they will persist across sessions.
Save:
model = # download from huggingface the 1st time #
tokenizer = # download from huggingface the 1st time #
...
import os, shutil
model_path = os.path.join('/kaggle/working', "YOUR_MODEL_NAME")
if os.path.exists(model_path): shutil.rmtree(model_path)
os.mkdir(model_path)
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)
Usage:
AutoModelForSeq2SeqLM.from_pretraiend(model_path)
AutoTokenizer.from_pretraiend(model_path)

Related

Can't use Keras CSVLogger callbacks in Sagemaker script mode. It fails to write the log file on S3 ( error - No such file or directory )

I have this script where I want to get the callbacks to a separate CSV file in sagemaker custom script docker container. But when I try to run in local mode, it fails giving the following error. I have a hyper-parameter tuning job(HPO) to run and this keeps giving me errors. I need to get this local mode run correctly before doing the HPO.
In the notebook I use the following code.
from sagemaker.tensorflow import TensorFlow
tf_estimator = TensorFlow(entry_point='lstm_model.py',
role=role,
code_location=custom_code_upload_location,
output_path=model_artifact_location+'/',
train_instance_count=1,
train_instance_type='local',
framework_version='1.12',
py_version='py3',
script_mode=True,
hyperparameters={'epochs': 1},
base_job_name='hpo-lstm-local-test'
)
tf_estimator.fit({'training': training_input_path, 'validation': validation_input_path})
In my lstm_model.py script the following code is used.
lgdir = os.path.join(model_dir, 'callbacks_log.csv')
csv_logger = CSVLogger(lgdir, append=True)
regressor.fit(x_train, y_train, batch_size=batch_size,
validation_data=(x_val, y_val),
epochs=epochs,
verbose=2,
callbacks=[csv_logger]
)
I tried creating a file before hand like shown below using tensorflow backend. But it doesn't create a file. ( K : tensorflow Backend, tf: tensorflow )
filename = tf.Variable(lgdir , tf.string)
content = tf.Variable("", tf.string)
sess = K.get_session()
tf.io.write_file(filename, content)
I can't use any other packages like pandas to create the file as the TensorFlow docker container in SageMaker for custom scripts doesn't provide them. They give only a limited amount of packages.
Is there a way I can write the csv file to the S3 bucket location, before the fit method try to write the callback. Or is that the solution to the problem? I am not sure.
If you can even suggest other suggestions to get callbacks, I would even accept that answer. But it should be worth the effort.
This docker image is really narrowing the scope.
Well for starters, you can always make your own docker image using the Tensorflow image as a base. I work in Tensorflow 2.0 so this will be slightly different for you but here is an example of my image pattern:
# Downloads the TensorFlow library used to run the Python script
FROM tensorflow/tensorflow:2.0.0a0 # you would use the equivalent for your TF version
# Contains the common functionality necessary to create a container compatible with Amazon SageMaker
RUN pip install sagemaker-containers -q
# Wandb allows us to customize and centralize logging while maintaining open-source agility
RUN pip install wandb -q # here you would install pandas
# Copies the training code inside the container to the design pattern created by the Tensorflow estimator
# here you could copy over a callbacks csv
COPY mnist-2.py /opt/ml/code/mnist-2.py
COPY callbacks.py /opt/ml/code/callbacks.py
COPY wandb_setup.sh /opt/ml/code/wandb_setup.sh
# Set the login script as the entry point
ENV SAGEMAKER_PROGRAM wandb_setup.sh # here you would instead launch lstm_model.py
I believe you are looking for a pattern similar to this, but I prefer to log all of my model data using Weights and Biases. They're a little out of data on their SageMaker integration but I'm actually in the midst of writing an updated tutorial for them. It should certainly be finished this month and include logging and comparing runs from hyperparameter tuning jobs

Saving a Keras/Sklearn in python and loading the saved model in tensorflow.js

I have a trained sklearn SVM model in .pkl format and a Keras .h5 model. Can I load these models using tensorflow.js on a browser?
I do most of my coding in python and not sure how to work with tensorflow.js
My model saving code looks like this
from sklearn.externals import joblib
joblib.dump(svc,'model.pkl')
model = joblib.load('model.pkl')
prediction = model.predict(X_test)
#------------------------------------------------------------------
from keras.models import load_model
model.save('model.h5')
model = load_model('my_model.h5')
In order to deploy your model with tensorflow-js, you need to use the tensorflowjs_converter, so you also need to install the tensorflowjs dependency.
You can do that in python via pip install tensorflowjs.
Next, you convert your trained model via this operation, according to your custom names: tensorflowjs_converter --input_format=keras /tmp/model.h5 /tmp/tfjs_model, where the last path is the output path of the conversion result.
Note that, after the conversion you will get a model.json (architecture of your model) and a list of N shards (weights split in N shards).
Then, in JavaScript, you need to us the function tf.loadLayersModel(MODEL_URL), where MODEL_URL is the url pointing to your model.json. Ensure that, at the same location with the model.json, the shards are also located.
Since this is an asynchronous operation(you do not want your web-page to get blocked while your model is loading), you need to use the JavaScript await keyword; hence await tf.loadLayersModel(MODEL_URL)
Please have a look at the following link to see an example: https://www.tensorflow.org/js/guide/conversion

Save Keras ModelCheckpoints in Google Cloud Bucket

I'm working on training a LSTM network on Google Cloud Machine Learning Engine using Keras with TensorFlow backend. I managed it to deploy my model and perform a successful training task after some adjustments to the gcloud and my python script.
I then tried to make my model save checkpoints after every epoch using Keras modelCheckpoint callback. Running a local training job with Google Cloud works perfectly as expected. The weights are getting stored in the specified path after each epoch. But when I try to run the same job online on Google Cloud Machine Learning Engine the weights.hdf5 does not get written to my Google Cloud Bucket. Instead I get the following error:
...
File "h5f.pyx", line 71, in h5py.h5f.open (h5py/h5f.c:1797)
IOError: Unable to open file (Unable to open file: name =
'gs://.../weights.hdf5', errno = 2, error message = 'no such file or
directory', flags = 0, o_flags = 0)
I investigated this issue and it turned out, that there is no Problem with the the Bucket itself, as Keras Tensorboard callback does work fine and writes the expected output to the same bucket. I also made sure that h5py gets included by providing it in the setup.py located at:
├── setup.py
└── trainer
├── __init__.py
├── ...
The actual include in setup.py is shown below:
# setup.py
from setuptools import setup, find_packages
setup(name='kerasLSTM',
version='0.1',
packages=find_packages(),
author='Kevin Katzke',
install_requires=['keras','h5py','simplejson'],
zip_safe=False)
I guess the problem comes down to the fact that GCS cannot be accessed with Pythons open for I/O since it instead provides a custom implementation:
import tensorflow as tf
from tensorflow.python.lib.io import file_io
with file_io.FileIO("gs://...", 'r') as f:
f.write("Hi!")
After checking how Keras modelCheckpoint callback implements the actual file writing and it turned out, that it is using h5py.File() for I/O:
with h5py.File(filepath, mode='w') as f:
f.attrs['keras_version'] = str(keras_version).encode('utf8')
f.attrs['backend'] = K.backend().encode('utf8')
f.attrs['model_config'] = json.dumps({
'class_name': model.__class__.__name__,
'config': model.get_config()
}, default=get_json_type).encode('utf8')
And as the h5py package is a Pythonic interface to the HDF5 binary data format the h5py.File() seems to call an underlying HDF5 functionality written in Fortran as far as I can tell: source, documentation.
How can I fix this and make the modelCheckpoint callback write to my GCS Bucket? Is there a way for "monkey patching" to somehow overwrite how a hdf5 file is opened to make it use GCS's file_io.FileIO()?
I might be a bit late on this, but for the sake of future visitors I would describe the whole process of how to adapt the code that was previously run locally to be GoogleML-aware from the IO point of view.
Python standard open(file_name, mode) does not work with buckets (gs://...../file_name). One needs to from tensorflow.python.lib.io import file_io and change all calls to open(file_name, mode) to file_io.FileIO(file_name, mode=mode) (note the named mode parameter). The interface of the opened handle is the same.
Keras and/or other libraries mostly use standard open(file_name, mode) internally. That said, something like trained_model.save(file_path) calls to 3rd-party libraries will fail to store the result to the bucket. The only way to retrieve a model after the job has finished successfully would be to store it locally and then move to the bucket.
The code below is quite inefficient, because it loads the whole model at once and then dumps it to the bucket, but it worked for me for relatively small models:
model.save(file_path)
with file_io.FileIO(file_path, mode='rb') as if:
with file_io.FileIO(os.path.join(model_dir, file_path), mode='wb+') as of:
of.write(if.read())
The mode must be set to binary for both reading and writing.
When the file is relatively big, it makes sense to read and write it in chunks to decrease memory consumption.
Before running a real task, I would advise to run a stub that simply saves a file to remote bucket.
This implementation, temporarily put instead of real train_model call, should do:
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--job-dir',
help='GCS location with read/write access',
required=True
)
args = parser.parse_args()
arguments = args.__dict__
job_dir = arguments.pop('job_dir')
with file_io.FileIO(os.path.join(job_dir, "test.txt"), mode='wb+') as of:
of.write("Test passed.")
After a successful execution you should see the file test.txt with a content "Test passed." in your bucket.
The issue can be solved with the following piece of code:
# Save Keras ModelCheckpoints locally
model.save('model.h5')
# Copy model.h5 over to Google Cloud Storage
with file_io.FileIO('model.h5', mode='r') as input_f:
with file_io.FileIO('model.h5', mode='w+') as output_f:
output_f.write(input_f.read())
print("Saved model.h5 to GCS")
The model.h5 is saved on local filesystem and the copied over to GCS. As Jochen pointed out, there currently is no easy support to write HDF5 model checkpoints to GCS. With this hack it is possible to write the data until an easier solution is provided.
I faced a similar problem and the solution above didn't work for me. The file must be read and be written in binary form. Otherwise this error will be thrown.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte
So the code will be
def copy_file_to_gcs(job_dir, file_path):
with file_io.FileIO(file_path, mode='rb') as input_f:
with file_io.FileIO(os.path.join(job_dir, file_path), mode='wb+') as output_f:
output_f.write(input_f.read())
Here is my code that I wrote to save the model after each epoch.
import os
import numpy as np
import warnings
from keras.callbacks import ModelCheckpoint
class ModelCheckpointGC(ModelCheckpoint):
"""Taken from and modified:
https://github.com/keras-team/keras/blob/tf-keras/keras/callbacks.py
"""
def on_epoch_end(self, epoch, logs=None):
logs = logs or {}
self.epochs_since_last_save += 1
if self.epochs_since_last_save >= self.period:
self.epochs_since_last_save = 0
filepath = self.filepath.format(epoch=epoch, **logs)
if self.save_best_only:
current = logs.get(self.monitor)
if current is None:
warnings.warn('Can save best model only with %s available, '
'skipping.' % (self.monitor), RuntimeWarning)
else:
if self.monitor_op(current, self.best):
if self.verbose > 0:
print('Epoch %05d: %s improved from %0.5f to %0.5f,'
' saving model to %s'
% (epoch, self.monitor, self.best,
current, filepath))
self.best = current
if self.save_weights_only:
self.model.save_weights(filepath, overwrite=True)
else:
if is_development():
self.model.save(filepath, overwrite=True)
else:
self.model.save(filepath.split(
"/")[-1])
with file_io.FileIO(filepath.split(
"/")[-1], mode='rb') as input_f:
with file_io.FileIO(filepath, mode='wb+') as output_f:
output_f.write(input_f.read())
else:
if self.verbose > 0:
print('Epoch %05d: %s did not improve' %
(epoch, self.monitor))
else:
if self.verbose > 0:
print('Epoch %05d: saving model to %s' % (epoch, filepath))
if self.save_weights_only:
self.model.save_weights(filepath, overwrite=True)
else:
if is_development():
self.model.save(filepath, overwrite=True)
else:
self.model.save(filepath.split(
"/")[-1])
with file_io.FileIO(filepath.split(
"/")[-1], mode='rb') as input_f:
with file_io.FileIO(filepath, mode='wb+') as output_f:
output_f.write(input_f.read())
There is a function is_development() that checks if it is the local or gcloud environment. On the local environment I did set the variable LOCAL_ENV=1:
def is_development():
"""check if the environment is local or in the gcloud
created the local variable in bash profile
export LOCAL_ENV=1
Returns:
[boolean] -- True if local env
"""
try:
if os.environ['LOCAL_ENV'] == '1':
return True
else:
return False
except:
return False
Then you can use it:
ModelCheckpointGC(
'gs://your_bucket/models/model.h5',
monitor='loss',
verbose=1,
save_best_only=True,
mode='min'))
I hope that helps someone and saves some time.
A hacky workaround is to save to local filesystem, then copy using TF IO API. I added an example to the Keras example on GoogleCloudPlatform ML samples.
Basically it checks whether the target directory is a GCS path ("gs://") and will force writing h5py to the local filesystem, then copy to GCS using the TF file_io API. See for example: https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/census/keras/trainer/task.py#L146
For me the easiest way is to use gsutil.
model.save('model.h5')
!gsutil -m cp model.h5 gs://name-of-cloud-storage/model.h5
I am not sure why this is not mentioned already, but there is a solution where you don't need to add a copy function in your code.
Install gcsfuse following these steps:
export GCSFUSE_REPO=gcsfuse-`lsb_release -c -s`
echo "deb http://packages.cloud.google.com/apt $GCSFUSE_REPO main" | sudo tee /etc/apt/sources.list.d/gcsfuse.list
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
sudo apt-get update
sudo apt-get install gcsfuse
Then mount your bucket locally:
mkdir bucket
gcsfuse <cloud_bucket_name> bucket
and then use the local directory bucket/ as the logdir of your model.
Syncing of cloud and local directory will be automated for you and your code can stay clean.
Hope it helps :)
tf.keras.models.save_model(model, filepath, save_format="tf")
save_format: Either 'tf' or 'h5', indicating whether to save the model to Tensorflow SavedModel or HDF5. Defaults to 'tf' in TF 2.X, and 'h5' in TF 1.X.

How to create a SavedModel from a TensorFlow checkpoint or model?

I was given a TensorFlow checkpoint and also an exported model, but to serve a model using Google ML Cloud, I need a saved_model.pbtxt file. It seems that I need to load the checkpoint and use SavedModelBuilder but SavedModelBuilder wants a dictionary of the names of the input and output nodes.
My question is, given the checkpoint or the exported model (below), how can I find the names of the nodes needed to generate the pbtxt file I need to serve the model via Google's ML Cloud service?
checkpoint
export.data-00000-of-00001
export.index
export.meta
options.json
The export.meta should be a MetaGraphDef proto. So you should be able to parse the proto to get the graph. You can then search through the nodes to find the node of interest.
Something like:
import argparse
from tensorflow.core.protobuf import meta_graph_pb2
import logging
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description='Argument parser.')
parser.add_argument('--path',
required=True,
help='The path to the metadata graph file.')
args = parser.parse_args()
with open(args.path, 'r') as hf:
graph = meta_graph_pb2.MetaGraphDef.FromString(hf.read())
print "graph: \n{0}".format(graph)
I think you should also be able to point TensorBoard at the directory containing that file and TensorBoard will render the graph and use that to identify the names of the input/output nodes.

How are weights saved in the CIFAR10 tutorial for tensorflow?

In the TensorFlow tutorial to train a network on CIFAR-10, where and how do they save the weights/parameters between running training and evaluation? I cannot see any files saved to my project directory.
Here are the links to the tutorial and the code:
https://www.tensorflow.org/versions/r0.11/tutorials/deep_cnn/index.html
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/models/image/cifar10
It saves the logs and checkpoints to the /tmp/ folder by default.
The weights are included in the checkpoint files.
As you can see in both eval and train files, it does take a checkpoint dir as parameter.
cifar10_train.py:
tf.app.flags.DEFINE_string('train_dir', '/tmp/cifar10_train',
"""Directory where to write event logs """
"""and checkpoint.""")
cifar10_eval.py:
tf.app.flags.DEFINE_string('eval_dir', '/tmp/cifar10_eval',
"""Directory where to write event logs.""")
tf.app.flags.DEFINE_string('eval_data', 'test',
"""Either 'test' or 'train_eval'.""")
tf.app.flags.DEFINE_string('checkpoint_dir', '/tmp/cifar10_train',
"""Directory where to read model checkpoints.""")
You can call those scripts with custom values for those. For my project using Inception I have to change it since the main hard drive does not have enough space for the bottlenecks created by inception.
It might be a good practice to explicitly set those values since the /tmp/ folder is not persistent and thus you might lose your training data.
The following code will save the training data into a custom folder.
python cifar10_train.py --train_dir="/home/username/train_folder"
and then, to evaluate:
python cifar10_eval.py --checkpoint_dir="/home/username/train_folder"
It also applies to the other examples.
Let's assume you're running cifar10_train, saving happens on this line:
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_train.py#L122
And the default location is defined in this line (it's "/tmp/cifar10_train"):
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_train.py#L51
In cifar10_eval, restoring the weights happens on this line:
https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/models/image/cifar10/cifar10_eval.py#L75