I've created a tensorflow session where the export.meta file is 553.17 MB. Whenever I try to load the exported graph into Google ML it crashes with the error:
gcloud beta ml models versions create --origin=${TRAIN_PATH}/model/ --model=${MODEL_NAME} v1
ERROR: (gcloud.beta.ml.models.versions.create) Error Response: [3] Create Version failed.Error accessing the model location gs://experimentation-1323-ml/face/model/. Please make sure that service account cloud-ml-service#experimentation-1323-10cd8.iam.gserviceaccount.com has read access to the bucket and the objects.
The graph is a static version of a VGG16 face recognition, so export is empty except for a dummy variable, while all the "weights" are constants in export.meta. Could that affect things? How do I go about debugging this?
Update (11/18/2017)
The service currently expects deployed models to have checkpoint files. Some models, such as inception, have folded variables into constants and therefore do not have checkpoint files. We will work on addressing this limitation in the service. In the meantime, as a workaround, you can create a dummy variable, e.g.,
import os
output_dir = 'my/output/dir'
dummy = tf.Variable([0])
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
saver.save(sess, os.path.join(output_dir, 'export'))
Update (11/17/2017)
A previous version of this post noted that the root cause of the problem was that the training service was producing V2 checkpoints but the prediction service was unable to consume them. This has now been fixed, so it is no longer necessary to force training to write V1 checkpoints; by default, V2 checkpoints are written.
Please retry.
Previous Answer
For future posterity, the following was the original answer, which may still apply to some users in some cases, so leaving here:
The error indicates that this is a permissions problem, and not related to the size of the model. The getting started instructions recommend running:
gcloud beta ml init-project
That generally sets up the permissions properly, as long as the bucket that has the model ('experimentation-1323-ml') is in the same project as you are using to deploy the model (the normal situation).
If things still aren't working, you'll need to follow these instructions for manually setting the correct permissions.
Related
In a Databricks notebook which is running on Cluster1 when I do
path='dbfs:/Shared/P1-Prediction/Weights_folder/Weights'
model.save_weights(path)
and then immediately try
ls 'dbfs:/Shared/P1-Prediction/Weights_folder'
I see the actual weights file in the output display
But When I run the exact same command
ls 'dbfs:/Shared/P1-Prediction/Weights_folder'
on a different Databricks notebook which is running on cluster 2, I am getting the error
ls: cannot access 'dbfs:/Shared/P1-Prediction/Weights_folder': No such file or directory
I am not able to intrepret this. Does that mean my "save_weights" is saving the weights in clusters memory and not in an actual physical location? If so is there a solution for it.
Any help is highly appreciated.
Tensorflow uses Python's local file API that doesn't work with dbfs:/... - you need to change path to use /dbfs/... instead of dbfs:/....
But really, it could be better to log model using MLflow, in this case you can easily load it for inference. See documentation and maybe this example.
I'm trying to train a new object detection model using the Create ML tool from Apple. I've already used RectLabel to generate annotations for all of the JPEG images in my directory of training images.
However, every time I try loading the directory in Create ML, I receive this error message:
Empty table from specified data source
I already looked on the Apple Developer forums and that thread incorrectly claims the problem was solved in a previous update.
What causes this error? How can I get Create ML to accept my training data?
I'm using Create ML Version 2.0 (53.2.2) and RectLabel Version 3.04.2 (3.04.2) on macOS Big Sur 11.0.1 (20B29).
The “Empty table from specified data source” error occurs if any of the filenames contain spaces.
My solution was to rename all the files so the filenames don't contain spaces.
Make sure that there are only images and annotations.json file in your directory of training images.
If there are any other files including .mlproj file in the folder, Create ML shows the "Empty table from specified data source" error.
When you create a new project on Create ML, specify outside the directory of training images.
so unfortunatly we have to redeploy our Databricks Workspace in which we use the MlFlow functonality with the Experiments and the registering of Models.
However if you export the user folder where the eyperiment is saved with a DBC and import it into the new workspace, the Experiments are not migrated and are just missing.
So the easiest solution did not work. The next thing I tried was to create a new experiment in the new workspace. Copy all the experiment data from the dbfs of the old workspace (with dbfs cp -r dbfs:/databricks/mlflow source, and then the same again to upload it to the new workspace) to the new one. And then just reference the location of the data to the experiment like in the following picture:
This is also not working, no run is visible, although the path is already existing.
The next idea was that the registred models are the most important one so at least those should be there and accessible. For that I used the documentation here: https://www.mlflow.org/docs/latest/model-registry.html.
With the following code you get a list of the registred models on the old workspace with the reference on the run_id and location.
from mlflow.tracking import MlflowClient
client = MlflowClient()
for rm in client.list_registered_models():
pprint(dict(rm), indent=4)
And with this code you can add models to a model registry with a reference to the location of the artifact data (on the new workspace):
# first the general model must be defined
client.create_registered_model(name='MyModel')
# and then the run of the model you want to registre will be added to the model as version one
client.create_model_version( name='MyModel', run_id='9fde022012046af935fe52435840cf1', source='dbfs:/databricks/mlflow/experiment_id/run_id/artifacts/model')
But that did also not worked out. if you go into the Model Registry you get a message like this: .
And I really checked, at the given path (the source) there the data is really uploaded and also a model is existing.
Do you have any new ideas to migrate those models in Databricks?
There is no official way to migrate experiments from one workspace to another. However, leveraging the MLflow API, there is an "unofficial" tool that can migrate experiments minus the notebook revision associated with a run.
mlflow-tools
As an addition to #Andre's anwser
you can also check mlflow-export-import from the same developer
mlflow-export-import
with this I successfully created a training job on sagemaker using the Tensorflow Object Detection API in a docker container. Now I'd like to monitor the training job using sagemaker, but cannot find anything explaining how to do it. I don't use a sagemaker notebook.
I think I can do it by saving the logs into a S3 bucket and point there a local tensorboard instance .. but don't know how to tell the tensorflow object detection API where to save the logs (is there any command line argument for this ?).
Something like this, but the script generate_tensorboard_command.py fails because my training job don't have the sagemaker_submit_directory parameter..
The fact is when I start the training job nothing is created on my s3 until the job finish and upload everything. There should be a way tell tensorflow where to save the logs (s3) during the training, hopefully without modifying the API source code..
Edit
I can finally make it works with the accepted solution (tensorflow natively supports read/write to s3), there are however additional steps to do:
Disable network isolation in the training job configuration
Provide credentials to the docker image to write to S3 bucket
The only thing is that Tensorflow continuously polls filesystem (i.e. looking for an updated model in serving mode) and this cause useless requests to S3, that you will have to pay (together with a buch of errors in the console). I opened a new question here for this. At least it works.
Edit 2
I was wrong, TF just write logs, is not polling so it's an expected behavior and the extra costs are minimal.
Looking through the example you posted, it looks as though the model_dir passed to the TensorFlow Object Detection package is configured to /opt/ml/model:
# These are the paths to where SageMaker mounts interesting things in your container.
prefix = '/opt/ml/'
input_path = os.path.join(prefix, 'input/data')
output_path = os.path.join(prefix, 'output')
model_path = os.path.join(prefix, 'model')
param_path = os.path.join(prefix, 'input/config/hyperparameters.json')
During the training process, tensorboard logs will be written to /opt/ml/model, and then uploaded to s3 as a final model artifact AFTER training: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-envvariables.html.
You might be able to side-step the SageMaker artifact upload step and point the model_dir of TensorFlow Object Detection API directly at an s3 location during training:
model_path = "s3://your-bucket/path/here
This means that the TensorFlow library within the SageMaker job is directly writing to S3 instead of the filesystem inside of it's container. Assuming the underlying TensorFlow Object Detection code can write directly to S3 (something you'll have to verify), you should be able to see the tensorboard logs and checkpoints there in realtime.
Use-case:
I am trying to load a pre-trained Keras Model as .h5 file in Google App Engine. I am running App Engine on a Python runtime 3.7 and Standard Environment.
Issue:
I tried using the load_model() Keras function. Unfortunately, the load_model function does require a 'file_path' and I failed to load the Model from the Google App Engine file explorer. Further, Google Cloud Storage seems not to be an option as it is not recognized as a file path.
Questions:
(1) How can I load a pretrained model (e.g. .h5) into Google App Engine (without saving it locally first)?
(2) Maybe there is a way to load the model.h5 into Google App Engine from Google Storage that I have not thought of, e.g by using another function (other than tf.keras.models.load_model()) or in another format?
I just want to read the model in order to make predictions. Writing or training the model in not required.
I finally managed to load the Keras Model in Google App Engine -- overcoming four challenges:
Solution:
First challenge: As of today Google App Engine does only provide TF Version 2.0.0x. Hence, make sure to set in your requirements.txt file the correct version. I ended up using 2.0.0b1 for my project.
Second challenge: In order to use a pretrained model, make sure the model has been saved using this particular TensorFlow Version, which is running on Google App Engine.
Third challenge: Google App Engine does not allow you to read from disk. The only possibility to read / or store data is to use memory respectively the /tmp folder (as correctly pointed out by user bhito). I ended up connecting my Gcloud bucket and loaded the model.h5 file as a blob into the /tmp folder.
Fourth challenge: By default the instance class of Google App Engine is limited to 256mb. Due to my model size, I needed to increase the instance class accordingly.
In summary, YES tf.keras.models.load_model() does work on App Engine reading from Cloud Storage having the right TF Version and the right instance (with enough memory)
I hope this will help future folks who want to use Google App Engine to deploy there ML Models.
You will have to download the file first before using it, Cloud Storage paths can't be used to access objects. There is a sample on how to download objects in the documentation:
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from the bucket."""
# bucket_name = "your-bucket-name"
# source_blob_name = "storage-object-name"
# destination_file_name = "local/path/to/file"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
blob.download_to_filename(destination_file_name)
print(
"Blob {} downloaded to {}.".format(
source_blob_name, destination_file_name
)
)
And then write the file to the /tmp temporary folder which is the only one available in App Engine. But you have to take into consideration that once the instance using the file is deleted, the file will be deleted as well.
Being more specific to your question, to load a keras model, it's useful to have it as a pickle, as this tutorial shows:
def _load_model():
global MODEL
client = storage.Client()
bucket = client.get_bucket(MODEL_BUCKET)
blob = bucket.get_blob(MODEL_FILENAME)
s = blob.download_as_string()
MODEL = pickle.loads(s)
I also have been able to found an answer to another Stackoverflow post that covers what you're actually looking for.