I saved model trained in Keras as .h5 file.
When trying to load it in distributed setting, I got the error:
InvalidArgumentError (see above for traceback): Cannot assign a device
to node 'policy/dense_2_b': Could not satisfy explicit device
specification '/job:ps/task:0' because no devices matching that
specification are registered in this process; available devices:
/job:localhost/replica:0/task:0/cpu:0
It seems that somehow variables in .h5 file are assigned to specific device with job "localhost". So when I am trying to load it on parameter server, I get the error.
Could anyone clarify how to address this? I probably should load keras model first without starting servers and then to reload it on parameter server. But details are unclear for me..
Related
Is this SavedModel just for Tensorflow front-end applications or it can be use to reload model in keras format. I created it using tf.saved_model.save and now I don't know what to make of it.
Following the guide above I was able to load a SavedModel directory, and it seemingly no use, not trainable nor use to predict input like model.predict, and that the only thing I have since I lost the h5 file in my files **cough trashbin **cough.
Note: I noticed this guide tell me to use tf.keras.models.load_model('inceptionv3')
and it return this
error
You have saved the model using tf.saved_model.save and so the correct way to load it back is tf.saved_model.load('inceptionv3'). This is also suggested in your error image.
After loading model, you can try doing prediction as follows:
model = tf.saved_model.load('inceptionv3')
out = model(inputs)
I recently use Google AutoML service to create a model.
Its output seems to be in a saved model format. However,when I attempt to load it via tf.saved_model.load ,it display following error
Op type not registered 'TreeEnsembleSerialize' in binary ...
When I look up this op,I find that this op exists in tf.contrib.boosted_trees in Tensorflow 1.15,but since Tensorflow 2 removes tf.contrib,this op has be renamed to BoostedTreesSerializeEnsemble in tf.raw_ops.
My question is:Is there any way to duplicate the op and rename it to TreeEnsembleSerialize ,so the saved model could be loaded without errors.
Thanks.
There are no significant compatibility concerns for saved models.
TensorFlow 1.x saved_models work in TensorFlow 2.x.
TensorFlow 2.x
saved_models work in TensorFlow 1.x if all the ops are supported.
For more information visit Tensorflow doc
At first I wanted to use Microsoft camera trap model on GCP AI platform using the .pb SavedModel.
But it wouldn't validate the version (SavedModel must exactly contain one metagraph) so I tried with tensorflow serving (docker):
Checked with github/tensorflow/tensorflow/python/tools/saved_model_cli.py
RuntimeError: MetaGraphDef associated with tag-set could not be found in SavedModel
I thought the model had been simply incorrectly exported so I tried with other SavedModels like https://tfhub.dev/google/openimages_v4/ssd/mobilenet_v2/1 (downloading the tar directly to get the .pb)
But I got exactly the same errors ... (also tried with the resnet one)
Am I doing something wrong ?
I have two Nvidia Titan X cards on my machine and want to finetune COCO pretrained Inception V2 model on a single specific class. I have created the train/val tfrecords and changed the config to run the tensorflow object detection training pipeline.
I am able to start the training but it hangs (without any OOM) whenever it tries to evaluate a checkpoint. Currently it is using only GPU 0 with other resource parameters (like RAM, CPU, IO etc) in normal range. So I am guessing that GPU is the bottleneck. I wanted to try splitting training and validation on separate GPUs and see if it works.
I tried to look for a place where I could do something like setting "CUDA_VISIBLE_DEVICES" differently for both the processes but unfortunately the latest tensorflow object detection API code (using tensorflow 1.12) makes it very difficult to do so. I am also unable to verify my assumption about training and validation running in same process as my machine hangs. Could someone please suggest where to look for to solve it?
I actually want to implement model parallelism automatically in the tensorflow.
I little bit correct the code of tensorflow in the placement code(simple_placer.cc) in version 1.3. However the placement was work in case of MNIST, but it has an error on inception.
InvalidArgumentError (see above for traceback): Trying to access resource located in device /job:worker/replica:0/task:1/cpu:0 from device /job:worker/replica:0/task:0/cpu:0
I want to get some advice about this error such as when the error comes up or what condition makes this errors.
Thanks.
This error typically happens when some operation attempts to read one of its inputs, but that input happens to reside on another device. Typically, when tensorflow places operations on different devices, it inserts send/recv nodes into the execution graph to exchange tensors between these devices. You changes might have broken some of that logic.