MultiWorkerMirroredStrategy() not working on Google AI-Platform (CMLE) - tensorflow

I'm getting the following error while using MultiWorkerMirroredStrategy() for training Custom Estimator on Google AI-Platform (CMLE).
ValueError: Unrecognized task_type: 'master', valid task types are: "chief", "worker", "evaluator" and "ps".
Both MirroredStrategy() and PamameterServerStrategy() are working fine on AI-Platform with their respective config.yaml files. I'm currently not providing device scopes for any operations. Neither I'm providing any device filter in session config, tf.ConfigProto(device_filters=device_filters).
The config.yaml file which I'm using for training with MultiWorkerMirroredStrategy() is:
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
workerType: standard_gpu
workerCount: 4
The masterType input is mandatory for submitting the training job on AI-Platform.
Note: It's showing 'chief' as a valid task type and 'master' as invalid. I'm providing tensorflow-gpu==1.14.0 in setup.py for trainer package.

I got into same issue. As far I understand MultiWorkerMirroredStrategy config values are different from other strategies and from what CMLE provides by default: https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration
It doesn't support 'master' node, it calls it 'chief' instead.
If you are running your jobs in container, you can try using 'useChiefInTfConfig' flag, see documentation here: https://developers.google.com/resources/api-libraries/documentation/ml/v1/python/latest/ml_v1.projects.jobs.html
Otherwise you might try hacking your TF_CONFIG manually:
TF_CONFIG = os.environ.get('TF_CONFIG')
if TF_CONFIG and '"master"' in TF_CONFIG:
os.environ['TF_CONFIG'] = TF_CONFIG.replace('"master"', '"chief"')

(1) This appears to be a bug then with MultiWorkerMirroredStrategy. Please file a bug in TensorFlow. In TensorFlow 1.x, it should be using master and in TensorFlow 2.x, it should be using chief. The code is (wrongly) asking for chief, and AI Platform (because you are using 1.14) is providing only master. Incidentally: master = chief + evaluator.
(2) Do not have add tensorflow to your setup.py. Provide the tensorflow framework you want AI Platform to use using the --runtime-version (See https://cloud.google.com/ml-engine/docs/runtime-version-list) flag to gcloud.

Related

Grappler optimization failed. Error: Op type not registered 'FusedBatchNormV3' in Tensorflow Serving

I am serving a model using Tensorflow Serving.
TensorFlow ModelServer: 1.13.0-rc1+dev.sha.fd92d2f
TensorFlow Library: 1.13.0-rc1
I sanity tested with load_model and predict(...) in notebook and it is making the expected predictions. The model is a ResNet50 with custom head (fine tuned).
If i try to submit the request as instructed in:
https://www.tensorflow.org/tfx/tutorials/serving/rest_simple
I got error
2022-02-10 22:22:09.120103: W external/org_tensorflow/tensorflow/core/kernels/partitioned_function_ops.cc:197] Grappler optimization failed. Error: Op type not registered 'FusedBatchNormV3' in binary running on tensorflow-no-gpu-20191205-rec-eng. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
2022-02-10 22:22:09.137225: W external/org_tensorflow/tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at partitioned_function_ops.cc:118 : Not found: Op type not registered 'FusedBatchNormV3' in binary running on tensorflow-no-gpu-20191205-rec-eng. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
Any idea how to resolve? Will provide more details upon request.
I found out what's wrong. the tensorflow server version seemed wrong. Just ensure it is:
TensorFlow ModelServer: 2.8.0-rc1+dev.sha.9400ef1
TensorFlow Library: 2.8.0

Create Version Failed. Bad model detected with error: "Error loading the model" - AI Platform Prediction

I created a model through AI Platform UI that uses a global endpoint. I am trying to deploy a basic tensorflow 1.15.0 model I exported using the Saved Model builder. When I try to deploy this model I get a Create Version Failed. Bad model detected with error: "Error loading the model" error in the UI and the I see the following in the logs:
ERROR:root:Failed to import GA GRPC module. This is OK if the runtime version is 1.x
Failure: Could not reach metadata service: Internal Server Error.
ERROR:root:Command '['/tools/google-cloud-sdk/bin/gsutil', '-o', 'GoogleCompute:service_account=default', 'cp', '-R', 'gs://cml-365057443918-1608667078774578/models/xsqr_global/v6/7349456410861999293/model/*', '/tmp/model/0001']' returned non-zero exit status 1.
ERROR:root:Error loading model: 'generator' object has no attribute 'next'
ERROR:root:Error loading the model
Framework/ML runtime version: Tensorflow 1.15.0
Python: 3.7.3
What is strange is that the gcloud ai-platform local predict works correctly with this exported model, and I can deploy this exact same model on a regional endpoint with no issues. It only gives this error if I try to use a global endpoint model. But I need the global endpoint because I plan on using a custom prediction routine (if I can get this basic model working first).
The logs seem to suggest an issue with copying the model from storage? I've tried giving various IAM roles additional viewer permissions, but I still get the same errors.
Thanks for the help.
I think it's the same issue as https://issuetracker.google.com/issues/175316320
The comment in the issue says the fix is now rolling out.
Today I faced the same error (ERROR: (gcloud.ai-platform.versions.create) Create Version failed. Bad model detected with error: "Error loading the model") & for those who wants a summary:
The recommendation is to use n1* machine types (for example: n1-standard-4) via regional endpoints (for example: us-central1) instead of mls1* machines while deploying version. Also I made sure to mention the same region (us-central1) while creating the model itself using the below command, thereby resolving the above mentioned error.
!gcloud ai-platform models create $model_name
--region=$REGION

Bad model deploying to GCP Cloudml

I’m trying to deploy a model trained using Tensorflow 1.7 onto Google Cloud Platform. I get the following error:
Create Version failed. Bad model detected with error: "Failed to load model: Loading servable: {name: default version: 1} failed: Not found: Op type not registered 'SparseFillEmptyRows'\n\n (Error code: 0)"
I know Cloudml runtime prediction only supports Tensorflow 1.6 so I tried specifying:
REQUIRED_PACKAGES = [
'tensorflow==1.6',
]
in setup.py but I still get the same message
Any help gratefully appreciated
You need to rebuild your model using TensorFlow 1.6. You can't deploy a model created with TensorFlow 1.7 to the ML Engine.
Also, you can set the version of the engine's runtime to one of the versions listed here. If you're using gcloud ml-engine jobs submit training, you can set the version with the --runtime-version flag. The documentation is here.
Rebuilding with 1.6 and deploying with --runtime-version=1.6 worked.

Python Configuration Error when build retrain.py by bazel, following google doc

I am learning transfer learning according to How to Retrain Inception's Final Layer for New Categories however, when I build 'retrain.py' using bazel, the following error ocures:
The error message is:
python configuration error:'PYTHON_BIN_PATH' environment variable is not set and referenced by '//third_party/py/numpy:headers'
I am so sorry, I have done my best to display the error image.unfortunately, I failed.
I use python2.7, anaconda2 and bazel0.6.1, tensorflow1.3.
appreciate for your any reply !

Getting error on ML-Engine predict but local predict works fine

I have searched a lot here but unfortunately could not find an answer.
I am running TensorFlow 1.3 (installed via PiP on MacOS) on my local machine, and have created a model using the provided "ssd_mobilenet_v1_coco" checkpoints.
I managed to train locally and on the ML-Engine (Runtime 1.2), and successfully deployed my savedModel to the ML-Engine.
Local predictions (below code) work fine and I get the model results
gcloud ml-engine local predict --model-dir=... --json-instances=request.json
FILE request.json: {"inputs": [[[242, 240, 239], [242, 240, 239], [242, 240, 239], [242, 240, 239], [242, 240, 23]]]}
However when deploying the model and trying to run on the ML-ENGINE for remote predictions with the code below:
gcloud ml-engine predict --model "testModel" --json-instances request.json(SAME JSON FILE AS BEFORE)
I get this error:
{
"error": "Prediction failed: Exception during model execution: AbortionError(code=StatusCode.INVALID_ARGUMENT, details=\"NodeDef mentions attr 'data_format' not in Op<name=DepthwiseConv2dNative; signature=input:T, filter:T -> output:T; attr=T:type,allowed=[DT_FLOAT, DT_DOUBLE]; attr=strides:list(int); attr=padding:string,allowed=[\"SAME\", \"VALID\"]>; NodeDef: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise = DepthwiseConv2dNative[T=DT_FLOAT, _output_shapes=[[-1,150,150,32]], data_format=\"NHWC\", padding=\"SAME\", strides=[1, 1, 1, 1], _device=\"/job:localhost/replica:0/task:0/cpu:0\"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Relu6, FeatureExtractor/MobilenetV1/Conv2d_1_depthwise/depthwise_weights/read)\n\t [[Node: FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_1_depthwise/depthwise = DepthwiseConv2dNative[T=DT_FLOAT, _output_shapes=[[-1,150,150,32]], data_format=\"NHWC\", padding=\"SAME\", strides=[1, 1, 1, 1], _device=\"/job:localhost/replica:0/task:0/cpu:0\"](FeatureExtractor/MobilenetV1/MobilenetV1/Conv2d_0/Relu6, FeatureExtractor/MobilenetV1/Conv2d_1_depthwise/depthwise_weights/read)]]\")"
}
I saw something similar here: https://github.com/tensorflow/models/issues/1581
About the problem being with the "data-format" parameter.
But unfortunately I could not use that solution since I am already on TensorFlow 1.3.
It also seems that it might be a problem with MobilenetV1: https:// github.com/ tensorflow/models/issues/2153
Any ideas?
I had a similar issue. This issue is due to mismatch in Tensorflow versions used for training and inference. I solved the issue by using Tensorflow - 1.4 for both training and inference.
Please refer to this answer.
If you're wondering how to ensure that your model version is running the correct tensorflow version that you need to run, first have a look at this model versions list page
You need to know which model version supports the Tensorflow version that you need. At the time of writing:
ML version 1.4 supports TensorFlow 1.4.0 and 1.4.1
ML version 1.2 supports TensorFlow 1.2.0 and
ML version 1.0 supports TensorFlow 1.0.1
Now that you know which model version you require, you need to create a new version from your model, like so:
gcloud ml-engine versions create <version name> \
--model=<Name of the model> \
--origin=<Model bucket link. It starts with gs://...> \
--runtime-version=1.4
In my case, I needed to predict using Tensorflow 1.4.1, so I used the runtime version 1.4.
Refer to this official MNIST tutorial page, as well as this ML Versioning Page