Create Version Failed. Bad model detected with error: "Error loading the model" - AI Platform Prediction - tensorflow

I created a model through AI Platform UI that uses a global endpoint. I am trying to deploy a basic tensorflow 1.15.0 model I exported using the Saved Model builder. When I try to deploy this model I get a Create Version Failed. Bad model detected with error: "Error loading the model" error in the UI and the I see the following in the logs:
ERROR:root:Failed to import GA GRPC module. This is OK if the runtime version is 1.x
Failure: Could not reach metadata service: Internal Server Error.
ERROR:root:Command '['/tools/google-cloud-sdk/bin/gsutil', '-o', 'GoogleCompute:service_account=default', 'cp', '-R', 'gs://cml-365057443918-1608667078774578/models/xsqr_global/v6/7349456410861999293/model/*', '/tmp/model/0001']' returned non-zero exit status 1.
ERROR:root:Error loading model: 'generator' object has no attribute 'next'
ERROR:root:Error loading the model
Framework/ML runtime version: Tensorflow 1.15.0
Python: 3.7.3
What is strange is that the gcloud ai-platform local predict works correctly with this exported model, and I can deploy this exact same model on a regional endpoint with no issues. It only gives this error if I try to use a global endpoint model. But I need the global endpoint because I plan on using a custom prediction routine (if I can get this basic model working first).
The logs seem to suggest an issue with copying the model from storage? I've tried giving various IAM roles additional viewer permissions, but I still get the same errors.
Thanks for the help.

I think it's the same issue as https://issuetracker.google.com/issues/175316320
The comment in the issue says the fix is now rolling out.

Today I faced the same error (ERROR: (gcloud.ai-platform.versions.create) Create Version failed. Bad model detected with error: "Error loading the model") & for those who wants a summary:
The recommendation is to use n1* machine types (for example: n1-standard-4) via regional endpoints (for example: us-central1) instead of mls1* machines while deploying version. Also I made sure to mention the same region (us-central1) while creating the model itself using the below command, thereby resolving the above mentioned error.
!gcloud ai-platform models create $model_name
--region=$REGION

Related

Unable to adapt codes to convert Tensorflow model for Google Earth Engine (EEification)

I tried to adapt the example from Colab github to train a model in Tensorflow and then convert the model into a GEE-friendly format (EEification) for export to use in GEE code editor.
However, for some reason, the EEification codes were unable to run successfully. I got this error:
ERROR: (gcloud.ai-platform.versions.create)
Error: model server never became ready. Please validate that your model file or container configuration are valid
Upon debugging this error, I get:
DEBUG: Making request: POST https://oauth2.googleapis.com/token
DEBUG: Starting new HTTPS connection (1): oauth2.googleapis.com:443
DEBUG: https://oauth2.googleapis.com:443 "POST /token HTTP/1.1" 200 None
DEBUG: https://us-central1-ml.googleapis.com:443 "GET /v1/projects/bucket_name/operations/model_name?alt=json HTTP/1.1" 200 None
Why is there no response from the server when I (mostly) changed only the variables from the examples to my objects (bucket_name & model_name are replacements for my actual bucket & model name)? The only major change that I can think of is the usage of Project ID instead of Project Name because the code refused to run if I were to use the Project Name.
What is the error here about and how can I troubleshoot this?
I was having the same issue and I solved it by ensuring that the tensorflow runtime in my colab notebook is same as the one used for the model on the AI platform (2.1 in case of the notebook that you shared). Try doing the following where you import tensorflow:
!pip install tensorflow==2.1.0
import tensorflow as tf
print(tf.__version__)

Error Connecting to Substrate: Unable to initialize the API: createType(StorageKey):: Derived

I've got a Substrate node running locally on my PC, following this tutorial. https://substrate.dev/docs/en/tutorials/create-your-first-substrate-chain/interact. It can be viewed on two ports:
Local: http://localhost:8000/substrate-front-end-template
On Your Network: http://192.168.56.1:8000/substrate-front-end-template
So I don't think connectivity is the issue.
Anyway, I bound the #polkadot/api to my node via the command:
yarn add #polkadot/api.
I'm now getting an error, in the browser, whenever I run my node:
Error Connecting to Substrate
Error: FATAL: Unable to initialize the API: createType(StorageKey):: Derived TypedArray constructor created an array which was too small
Can anyone help?
Upgrading to the latest Substrate Sidecar API resolved these issues for me.

MultiWorkerMirroredStrategy() not working on Google AI-Platform (CMLE)

I'm getting the following error while using MultiWorkerMirroredStrategy() for training Custom Estimator on Google AI-Platform (CMLE).
ValueError: Unrecognized task_type: 'master', valid task types are: "chief", "worker", "evaluator" and "ps".
Both MirroredStrategy() and PamameterServerStrategy() are working fine on AI-Platform with their respective config.yaml files. I'm currently not providing device scopes for any operations. Neither I'm providing any device filter in session config, tf.ConfigProto(device_filters=device_filters).
The config.yaml file which I'm using for training with MultiWorkerMirroredStrategy() is:
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
workerType: standard_gpu
workerCount: 4
The masterType input is mandatory for submitting the training job on AI-Platform.
Note: It's showing 'chief' as a valid task type and 'master' as invalid. I'm providing tensorflow-gpu==1.14.0 in setup.py for trainer package.
I got into same issue. As far I understand MultiWorkerMirroredStrategy config values are different from other strategies and from what CMLE provides by default: https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration
It doesn't support 'master' node, it calls it 'chief' instead.
If you are running your jobs in container, you can try using 'useChiefInTfConfig' flag, see documentation here: https://developers.google.com/resources/api-libraries/documentation/ml/v1/python/latest/ml_v1.projects.jobs.html
Otherwise you might try hacking your TF_CONFIG manually:
TF_CONFIG = os.environ.get('TF_CONFIG')
if TF_CONFIG and '"master"' in TF_CONFIG:
os.environ['TF_CONFIG'] = TF_CONFIG.replace('"master"', '"chief"')
(1) This appears to be a bug then with MultiWorkerMirroredStrategy. Please file a bug in TensorFlow. In TensorFlow 1.x, it should be using master and in TensorFlow 2.x, it should be using chief. The code is (wrongly) asking for chief, and AI Platform (because you are using 1.14) is providing only master. Incidentally: master = chief + evaluator.
(2) Do not have add tensorflow to your setup.py. Provide the tensorflow framework you want AI Platform to use using the --runtime-version (See https://cloud.google.com/ml-engine/docs/runtime-version-list) flag to gcloud.

Bad model deploying to GCP Cloudml

I’m trying to deploy a model trained using Tensorflow 1.7 onto Google Cloud Platform. I get the following error:
Create Version failed. Bad model detected with error: "Failed to load model: Loading servable: {name: default version: 1} failed: Not found: Op type not registered 'SparseFillEmptyRows'\n\n (Error code: 0)"
I know Cloudml runtime prediction only supports Tensorflow 1.6 so I tried specifying:
REQUIRED_PACKAGES = [
'tensorflow==1.6',
]
in setup.py but I still get the same message
Any help gratefully appreciated
You need to rebuild your model using TensorFlow 1.6. You can't deploy a model created with TensorFlow 1.7 to the ML Engine.
Also, you can set the version of the engine's runtime to one of the versions listed here. If you're using gcloud ml-engine jobs submit training, you can set the version with the --runtime-version flag. The documentation is here.
Rebuilding with 1.6 and deploying with --runtime-version=1.6 worked.

Python Configuration Error when build retrain.py by bazel, following google doc

I am learning transfer learning according to How to Retrain Inception's Final Layer for New Categories however, when I build 'retrain.py' using bazel, the following error ocures:
The error message is:
python configuration error:'PYTHON_BIN_PATH' environment variable is not set and referenced by '//third_party/py/numpy:headers'
I am so sorry, I have done my best to display the error image.unfortunately, I failed.
I use python2.7, anaconda2 and bazel0.6.1, tensorflow1.3.
appreciate for your any reply !