I'm trying to add TensorBoard functionality to this SageMaker example: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/hyperparameter_tuning/keras_bring_your_own/hpo_bring_your_own_keras_container.ipynb
The issue is that SageMaker's Estimator.fit() does not seem to support Keras models compiled with callbacks.
Now from this git issue post it was described that what I need to do for TensorBoard functionality is,
"You need your code inside the container to save checkpoints to S3,
and you need to periodically sync your local Tensorboard log directory
with your S3 checkpoints."
So to sum it all up, to enable TensorBoard in SageMaker with this custom Keras docker image, it looks like I need a way of periodically uploading a file to an S3 bucket during training without using callbacks. Is this possible to do? I was considering trying to shove this code into a custom loss function, but I'm not sure if this would be the way to go about it. Any help is greatly appreciated!
Related
How to re train a model with new data that is currently served in production using tensorflow serving?
Do we have to train the model manually and serve it again? Or is there any automated way of doing this.
I am using tensorflow serving with docker.
Basically the idea is that:
Considering there is already a model served using tensorflow serving, and in the future I get some bunch of additional data and I want the model to be fitted with this data then, how can we do this training to the same model?
Question 1: I do have a script to train the model, but does the training have to be done locally/manually?
Answer: As far as i understand you are talking it should be done locally or in some remote server, you can do wherever as per convenience the main important step for tensorflow serving is to save model in the respective format that could be used by the server. Please refer to the link on how to save as well as how to load it in the serving docker container.
serving tensorflow model
Question 2: Suppose I create a entirely new model (apart from modelA currently server), how can I load it to tensorflow serving again? Do I have to manually load it to the docker target path?
Answer: Yes if you are loading it without using serving config, you will have to manually shut down container, remap the path in the command and then load it in the docker container. That is where the serving config helps you to load models in runtime only.
Question 3: TFX document says to update the model.config file for adding new models, but how can I update it when the serving is running.
Answer: A basic configuration file would look like this
config {
name: 'my_first_model'
base_path: '/tmp/my_first_model/'
model_platform: 'tensorflow'
}
config {
name: 'my_second_model'
base_path: '/tmp/my_second_model/'
model_platform: 'tensorflow'
}
}
This file would be needed to be mapped before starting your docker container and of course the path as well where different models will be located. This config file when changed will load new models accordingly in the serving docker container. You can also maintain different versions of the same model as well. For more info please refer to this link serving config. This file is automatically looked up by the serving periodically and as soon as it detects some change it will automatically load new models without the need to restart the docker container.
I've already managed to create a lambda function that loads a model.pb from S3 and apply object detection to an input image (installed tensorflow 1.12)
Is it possible to load a Sagemaker model/endpoint-configuration inside a lambda function ? I mean install all packages needed inside the lambda, without deploying an endpoint/ec2-like instance.
I guess inference performance would drop, but the solution seems to be more cost effective and scalable ready.
SageMaker Endpoints only run on EC2 instances which are configured as part of the EndpointConfig. You can't use SageMaker models and deploy them onto Lambda.
When I access my Kubeflow endpoint to upload and run a pipeline using a cloned TFX, the process starts hanging at the first step producing this message:
"This step is in Pending state with this message: ImagePullBackOff: Back-off pulling image "tensorflow/tfx:0.14.0dev", which is the same image used in the created pipeline yaml file.
My overall goal is to build an ExampleGen for tfrecords files, just as described in the guide here. The most recent tfx version in pip is 0.13 and does not yet include the necessary functions. For this reason, I install tf-nightly and clone/build tfx (dev-version 0.14). Doing so and installing some additional modules, e.g. tensorflow_data_validation, I can now create my pipeline using the tfx components and including an ExampleGen for tfrecords files. I finally build the pipeline with the KubeflowRunner. Yet this yields the error stated above.
I now wonder about an appropriate way to address this. I guess one way would be to build an image myself with the specified versions, but maybe there is a more practical way?
TFX doesn't have a nightly image build as yet. Currently, it defaults to using the image tagged with the version of the library you use to build the pipeline, hence the reason the tag is 0.14dev0. This is the current version at HEAD, see here:
https://github.com/tensorflow/tfx/blob/a1f43af5e66f9548ae73eb64813509445843eb53/tfx/version.py#L17
You can build your own image and push it somewhere, for example gcr.io/your-gcp-project/your-image-name:tag, and specify that the pipeline use this image instead, by customizing the tfx_image argument to the pipeline:
https://github.com/tensorflow/tfx/blob/74f9b6ab26c51ebbfb5d17826c5d5288a67dcf85/tfx/orchestration/kubeflow/base_component.py#L54
See for example:
https://github.com/tensorflow/tfx/blob/b3796fc37bd4331a4e964c822502ba5096ad4bb6/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_kubeflow.py#L243
I'm attempting to train a Tensorflow Estimator on AI Platform. The model trains on local perfectly fine, albeit extremely slowly, but right when I try to run distributed-GPU training on AI Platform I run into this error:
CommandException: No URLs matched: gs://path/.../trainer-0.1.tar.gz
I have my code packaged with the trainer module as recommended by Google Cloud AI Platform. Any help would be appreciated!
I was actually able to fix my issue: it appears that if I don't set up a staging bucket then the model dir where checkpoints are stored will overwrite the trainer package before the worker replicas are able to download the trainer! I'm unsure how the checkpoints were even able to begin being stored when the worker replicas hadn't all downloaded the trainer yet, but adding the staging bucket that was different from my model dir fixed this.
I want to use TensorBoard to visualize results stored on an S3 server, without downloading them to my machine. Ideally, this would work:
$ tensorboard --logdir s3://mybucket/summary
Assuming the tfevents files are stored under summary. However this does not work and returns UnimplementedError: File system scheme s3 not implemented.
Is there some workaround to enable TensorBoard to access the data on the server?
The S3 File system plugin for tensorflow was released in Version 1.4 in early October. You'll need to make sure your tensorflow-tensorboard version is at least pip install tensorflow-tensorboard==0.4.0-rc1
Then you can start the server:
tensorboard --logdir=s3://root-bucket/jobs/4/train