I've been trying to run the object_detection_tutorial located at https://github.com/tensorflow/models/tree/master/research/object_detection but keep getting an error when trying to load the model.
I get:
INFO:tensorflow:Saver not created because there are no variables in the graph to restore
when i try to do:
tf.saved_model.load(model_dir)
I've gone through the installation instructions and done all of that but I can't load the model. Anyone have any idea? Thanks.
Related
I am trying to deploy my model. I am encountering the following problem:
FileNotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ram://a603e930-4fda-4105-8554-7af5e5fc02f5/variables/variables
You may be trying to load on a different device from the computational device. Consider setting the experimental_io_device option in tf.saved_model.LoadOptions to the io_device such as '/job:localhost'
This happened when I stored an NLP model in a pickle. I have seen that it does not work so I then tried to save the model as .h5 The problem still persists and shows me the above error.
i followed this guide to set up the adt explorer. i can upload models into the model view, but i cant put any of them into the graph view. i get this error: Error in instance creation: SyntaxError: unexpected token o in JSON at position 1.
it doesnt seem like there is anything wrong with the models, since they work with an older version of the adt explorer.
is there any of the packages that can cause this problem? or could it be that my pc couldn't do the console installation properly for some reason?
edit: i can put models into the graph view on earlier versions of the adt explorer then close it and start up the latest version again and run query to get the models into the graph view. so the problem seems to be trying to create new twins, straight form the models themselves or by importing graph.
Something is wrong with the ADT explorer i think. I can't make it to work. As a workaround, use the az cli (MS Link).
az dt twin create -n <ADT_instance_name> --dtmi "<dtmi>" --twin-id <twin-id>
with this I successfully created a training job on sagemaker using the Tensorflow Object Detection API in a docker container. Now I'd like to monitor the training job using sagemaker, but cannot find anything explaining how to do it. I don't use a sagemaker notebook.
I think I can do it by saving the logs into a S3 bucket and point there a local tensorboard instance .. but don't know how to tell the tensorflow object detection API where to save the logs (is there any command line argument for this ?).
Something like this, but the script generate_tensorboard_command.py fails because my training job don't have the sagemaker_submit_directory parameter..
The fact is when I start the training job nothing is created on my s3 until the job finish and upload everything. There should be a way tell tensorflow where to save the logs (s3) during the training, hopefully without modifying the API source code..
Edit
I can finally make it works with the accepted solution (tensorflow natively supports read/write to s3), there are however additional steps to do:
Disable network isolation in the training job configuration
Provide credentials to the docker image to write to S3 bucket
The only thing is that Tensorflow continuously polls filesystem (i.e. looking for an updated model in serving mode) and this cause useless requests to S3, that you will have to pay (together with a buch of errors in the console). I opened a new question here for this. At least it works.
Edit 2
I was wrong, TF just write logs, is not polling so it's an expected behavior and the extra costs are minimal.
Looking through the example you posted, it looks as though the model_dir passed to the TensorFlow Object Detection package is configured to /opt/ml/model:
# These are the paths to where SageMaker mounts interesting things in your container.
prefix = '/opt/ml/'
input_path = os.path.join(prefix, 'input/data')
output_path = os.path.join(prefix, 'output')
model_path = os.path.join(prefix, 'model')
param_path = os.path.join(prefix, 'input/config/hyperparameters.json')
During the training process, tensorboard logs will be written to /opt/ml/model, and then uploaded to s3 as a final model artifact AFTER training: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-envvariables.html.
You might be able to side-step the SageMaker artifact upload step and point the model_dir of TensorFlow Object Detection API directly at an s3 location during training:
model_path = "s3://your-bucket/path/here
This means that the TensorFlow library within the SageMaker job is directly writing to S3 instead of the filesystem inside of it's container. Assuming the underlying TensorFlow Object Detection code can write directly to S3 (something you'll have to verify), you should be able to see the tensorboard logs and checkpoints there in realtime.
I am unable to create a tfrecord and there is no error in the code, it was just executing when i have given this code
python generate_tfrecords.py --csv_input=images\train_labels.csv --image_dir=images\train --output_path=train.record
i have also checked the image files all are in one folder and with the same format.
please help to create the tfrecord
Thanking you
created a images using labelImg and also did the protobuf part
there are no error but it was not creating a train record
I've created a tensorflow session where the export.meta file is 553.17 MB. Whenever I try to load the exported graph into Google ML it crashes with the error:
gcloud beta ml models versions create --origin=${TRAIN_PATH}/model/ --model=${MODEL_NAME} v1
ERROR: (gcloud.beta.ml.models.versions.create) Error Response: [3] Create Version failed.Error accessing the model location gs://experimentation-1323-ml/face/model/. Please make sure that service account cloud-ml-service#experimentation-1323-10cd8.iam.gserviceaccount.com has read access to the bucket and the objects.
The graph is a static version of a VGG16 face recognition, so export is empty except for a dummy variable, while all the "weights" are constants in export.meta. Could that affect things? How do I go about debugging this?
Update (11/18/2017)
The service currently expects deployed models to have checkpoint files. Some models, such as inception, have folded variables into constants and therefore do not have checkpoint files. We will work on addressing this limitation in the service. In the meantime, as a workaround, you can create a dummy variable, e.g.,
import os
output_dir = 'my/output/dir'
dummy = tf.Variable([0])
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(tf.initialize_all_variables())
saver.save(sess, os.path.join(output_dir, 'export'))
Update (11/17/2017)
A previous version of this post noted that the root cause of the problem was that the training service was producing V2 checkpoints but the prediction service was unable to consume them. This has now been fixed, so it is no longer necessary to force training to write V1 checkpoints; by default, V2 checkpoints are written.
Please retry.
Previous Answer
For future posterity, the following was the original answer, which may still apply to some users in some cases, so leaving here:
The error indicates that this is a permissions problem, and not related to the size of the model. The getting started instructions recommend running:
gcloud beta ml init-project
That generally sets up the permissions properly, as long as the bucket that has the model ('experimentation-1323-ml') is in the same project as you are using to deploy the model (the normal situation).
If things still aren't working, you'll need to follow these instructions for manually setting the correct permissions.