Use tensorboard with object detection API in sagemaker - object-detection

with this I successfully created a training job on sagemaker using the Tensorflow Object Detection API in a docker container. Now I'd like to monitor the training job using sagemaker, but cannot find anything explaining how to do it. I don't use a sagemaker notebook.
I think I can do it by saving the logs into a S3 bucket and point there a local tensorboard instance .. but don't know how to tell the tensorflow object detection API where to save the logs (is there any command line argument for this ?).
Something like this, but the script generate_tensorboard_command.py fails because my training job don't have the sagemaker_submit_directory parameter..
The fact is when I start the training job nothing is created on my s3 until the job finish and upload everything. There should be a way tell tensorflow where to save the logs (s3) during the training, hopefully without modifying the API source code..
Edit
I can finally make it works with the accepted solution (tensorflow natively supports read/write to s3), there are however additional steps to do:
Disable network isolation in the training job configuration
Provide credentials to the docker image to write to S3 bucket
The only thing is that Tensorflow continuously polls filesystem (i.e. looking for an updated model in serving mode) and this cause useless requests to S3, that you will have to pay (together with a buch of errors in the console). I opened a new question here for this. At least it works.
Edit 2
I was wrong, TF just write logs, is not polling so it's an expected behavior and the extra costs are minimal.

Looking through the example you posted, it looks as though the model_dir passed to the TensorFlow Object Detection package is configured to /opt/ml/model:
# These are the paths to where SageMaker mounts interesting things in your container.
prefix = '/opt/ml/'
input_path = os.path.join(prefix, 'input/data')
output_path = os.path.join(prefix, 'output')
model_path = os.path.join(prefix, 'model')
param_path = os.path.join(prefix, 'input/config/hyperparameters.json')
During the training process, tensorboard logs will be written to /opt/ml/model, and then uploaded to s3 as a final model artifact AFTER training: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-envvariables.html.
You might be able to side-step the SageMaker artifact upload step and point the model_dir of TensorFlow Object Detection API directly at an s3 location during training:
model_path = "s3://your-bucket/path/here
This means that the TensorFlow library within the SageMaker job is directly writing to S3 instead of the filesystem inside of it's container. Assuming the underlying TensorFlow Object Detection code can write directly to S3 (something you'll have to verify), you should be able to see the tensorboard logs and checkpoints there in realtime.

Related

saving weights of a tensorflow model in Databricks

In a Databricks notebook which is running on Cluster1 when I do
path='dbfs:/Shared/P1-Prediction/Weights_folder/Weights'
model.save_weights(path)
and then immediately try
ls 'dbfs:/Shared/P1-Prediction/Weights_folder'
I see the actual weights file in the output display
But When I run the exact same command
ls 'dbfs:/Shared/P1-Prediction/Weights_folder'
on a different Databricks notebook which is running on cluster 2, I am getting the error
ls: cannot access 'dbfs:/Shared/P1-Prediction/Weights_folder': No such file or directory
I am not able to intrepret this. Does that mean my "save_weights" is saving the weights in clusters memory and not in an actual physical location? If so is there a solution for it.
Any help is highly appreciated.
Tensorflow uses Python's local file API that doesn't work with dbfs:/... - you need to change path to use /dbfs/... instead of dbfs:/....
But really, it could be better to log model using MLflow, in this case you can easily load it for inference. See documentation and maybe this example.

Data that should be saved in drive during run is gone

I have a colab pro plus subscription, and I ran a python code that runs a network using pytorch.
Until now, I succeeded in training my network and using my Google drive to save checkpoints and such.
But now I had a run that ran for around 16 hours, and no checkpoint or any other data was saved - even though the logs clearly show that I have saved the data and even evaluated the metrics on saved data.
Maybe the data was saved on a different folder somehow?
I looked in the drive activity and I could not see any data that was saved.
Has anyone ran to this before?
Any help would be appreciated.

Loading Keras Model in [Google App Engine]

Use-case:
I am trying to load a pre-trained Keras Model as .h5 file in Google App Engine. I am running App Engine on a Python runtime 3.7 and Standard Environment.
Issue:
I tried using the load_model() Keras function. Unfortunately, the load_model function does require a 'file_path' and I failed to load the Model from the Google App Engine file explorer. Further, Google Cloud Storage seems not to be an option as it is not recognized as a file path.
Questions:
(1) How can I load a pretrained model (e.g. .h5) into Google App Engine (without saving it locally first)?
(2) Maybe there is a way to load the model.h5 into Google App Engine from Google Storage that I have not thought of, e.g by using another function (other than tf.keras.models.load_model()) or in another format?
I just want to read the model in order to make predictions. Writing or training the model in not required.
I finally managed to load the Keras Model in Google App Engine -- overcoming four challenges:
Solution:
First challenge: As of today Google App Engine does only provide TF Version 2.0.0x. Hence, make sure to set in your requirements.txt file the correct version. I ended up using 2.0.0b1 for my project.
Second challenge: In order to use a pretrained model, make sure the model has been saved using this particular TensorFlow Version, which is running on Google App Engine.
Third challenge: Google App Engine does not allow you to read from disk. The only possibility to read / or store data is to use memory respectively the /tmp folder (as correctly pointed out by user bhito). I ended up connecting my Gcloud bucket and loaded the model.h5 file as a blob into the /tmp folder.
Fourth challenge: By default the instance class of Google App Engine is limited to 256mb. Due to my model size, I needed to increase the instance class accordingly.
In summary, YES tf.keras.models.load_model() does work on App Engine reading from Cloud Storage having the right TF Version and the right instance (with enough memory)
I hope this will help future folks who want to use Google App Engine to deploy there ML Models.
You will have to download the file first before using it, Cloud Storage paths can't be used to access objects. There is a sample on how to download objects in the documentation:
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# bucket_name = "your-bucket-name"
# source_blob_name = "storage-object-name"
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print(
"Blob {} downloaded to {}.".format(
source_blob_name, destination_file_name
)
)
And then write the file to the /tmp temporary folder which is the only one available in App Engine. But you have to take into consideration that once the instance using the file is deleted, the file will be deleted as well.
Being more specific to your question, to load a keras model, it's useful to have it as a pickle, as this tutorial shows:
def _load_model():
global MODEL
client = storage.Client()
bucket = client.get_bucket(MODEL_BUCKET)
blob = bucket.get_blob(MODEL_FILENAME)
s = blob.download_as_string()
MODEL = pickle.loads(s)
I also have been able to found an answer to another Stackoverflow post that covers what you're actually looking for.

Does google colab permanently change file

I am doing some data pre-processing on Google Colab and just wondering how it works with manipulating dataset. For example R does not change the original dataset until you use write.csv to export the changed dataset. Does it work similarly in colab? Thank you!
Until you explicitly save your changed data, e.g. using df.to_csv to the same file you read from, your changed dataset is not saved.
You must remember that due to inactivity (up to an hour or so), you colab session might expire and all progress be lost.
Update
To download a model, dataset or a big file from Google Drive, gdown command is already available
!gdown https://drive.google.com/uc?id=FILE_ID
Download your code from GitHub and run predictions using the model you already downloaded
!git clone https://USERNAME:PASSWORD#github.com/username/project.git
Write ! before a line of your code in colab and it would be treated as bash command. You can download files form internet using wget for example
!wget file_url
You can commit and push your updated code to GitHub etc. And updated dataset / model to Google Drive or Dropbox.

Create large model version in Google ML fails

I've created a tensorflow session where the export.meta file is 553.17 MB. Whenever I try to load the exported graph into Google ML it crashes with the error:
gcloud beta ml models versions create --origin=${TRAIN_PATH}/model/ --model=${MODEL_NAME} v1
ERROR: (gcloud.beta.ml.models.versions.create) Error Response: [3] Create Version failed.Error accessing the model location gs://experimentation-1323-ml/face/model/. Please make sure that service account cloud-ml-service#experimentation-1323-10cd8.iam.gserviceaccount.com has read access to the bucket and the objects.
The graph is a static version of a VGG16 face recognition, so export is empty except for a dummy variable, while all the "weights" are constants in export.meta. Could that affect things? How do I go about debugging this?
Update (11/18/2017)
The service currently expects deployed models to have checkpoint files. Some models, such as inception, have folded variables into constants and therefore do not have checkpoint files. We will work on addressing this limitation in the service. In the meantime, as a workaround, you can create a dummy variable, e.g.,
import os
output_dir = 'my/output/dir'
dummy = tf.Variable([0])
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
saver.save(sess, os.path.join(output_dir, 'export'))
Update (11/17/2017)
A previous version of this post noted that the root cause of the problem was that the training service was producing V2 checkpoints but the prediction service was unable to consume them. This has now been fixed, so it is no longer necessary to force training to write V1 checkpoints; by default, V2 checkpoints are written.
Please retry.
Previous Answer
For future posterity, the following was the original answer, which may still apply to some users in some cases, so leaving here:
The error indicates that this is a permissions problem, and not related to the size of the model. The getting started instructions recommend running:
gcloud beta ml init-project
That generally sets up the permissions properly, as long as the bucket that has the model ('experimentation-1323-ml') is in the same project as you are using to deploy the model (the normal situation).
If things still aren't working, you'll need to follow these instructions for manually setting the correct permissions.