Gcloud ai-platform local predict Error: gcloud crashed (PermissionError): [WinError 5] Access is denied - tensorflow

I was trying to run a command to test local predict in my computer. However, the command failed every time with this error.
ERROR: gcloud crashed (PermissionError): [WinError 5] Access is denied
This is the command:
gcloud ai-platform local predict --model-dir model_final --json-instances image_b64.json --framework tensorflow
I am positive 101% positive that I have followed everything in the doc by Google.
First, the command required a model file to be saved in TensorFlow SavedModel format, which, since I use Keras, I can just do model.save("model_final").
If you have used Keras for training, use tf.keras.Model.save to export a SavedModel
So I did, at it only output a single file, so I can only assume it's the file to be placed in the --model-dir parameter. I admit doing model.save("model_final") created a file, not a dir, which is a bit weird but the document for Keras just said use that so there is no way I could be wrong.
And also:
If you export your SavedModel using tf.keras.Model.save, then you do not need to specify a serving input function.
If you export a SavedModel from tf.keras or from a TensorFlow estimator, the exported graph is ready for serving by default.
The "image_b64.json" file follows this format:
{"image_bytes":{"b64": base64_jpeg_data )}}
So after 3 hours and having followed everything required by Google, and somehow the gloud still throws me that error. And, yes, of course I have run the command line under Administrator Mode. I also tried it in two of my computers, and I got the same error. I am using Windows, Tensorflow 1.15.
Can anyone point out what is the problem with my implementation, or Google Doc/Keras is just lack lustering. Thank you.

Related

How can I Transfer Learn using TPU on Colab

I am trying to teach myself Transfer Learning techniques using Tensorflow 2 on Colab.
Using GPU is working fine but as everybody knows Google has its TPUs and they are faster than GPUs.
In Colab, when I switch "Type" from GPU to "TPU, I am adding --use_tpu=true to the command below
python /content/models/research/object_detection/model_main_tf2.py \
--pipeline_config_path={pipeline_fname} \
--model_dir={model_dir} \
--checkpoint_dir={model_dir} \
--eval_timeout=60 \
--use_tpu=true
This script is found in the Models repo.
git clone --quiet https://github.com/tensorflow/models.git
However, it is not working and a few minutes later, I get the following error message:
tensorflow.python.framework.errors_impl.UnimplementedError: File system scheme '[local]' **not implemented (file: '/content/driving-object-detection/training/train')**
Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/tpu_strategy.py", line 540, in async_wait
context.async_wait()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 2319, in async_wait
context().sync_executors()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/context.py", line 658, in sync_executors
pywrap_tfe.TFE_ContextSyncExecutors(self._context_handle)
tensorflow.python.framework.errors_impl.UnimplementedError: File system scheme '[local]' not implemented (file: '/content/driving-object-detection/training/train')
Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
2020-10-23 15:53:03.698253: W ./tensorflow/core/distributed_runtime/eager/destroy_tensor_handle_node.h:57] Ignoring an error encountered when deleting remote tensors handles: Invalid argument: Unable to find the relevant tensor remote_handle: Op ID: 16039, Output num: 1
Additional GRPC error information from remote target /job:worker/replica:0/task:0:
:{"created":"#1603468383.693602692","description":"Error received from peer ipv4:10.72.50.114:8470","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Unable to find the relevant tensor remote_handle: Op ID: 16039, Output num: 1","grpc_status":3}
Which additional steps Am I supposed to take to prepare the files for TPU? As I mentioned before the dataset, folders structure, I followed the tensorflow.org and working well with GPU.
It is not just as simple as adding the "--use_tpu=true". Is there a step by step guide or can anyone shed some light?
Here is a guide: https://www.tensorflow.org/guide/tpu.
TPU can only access to datasets from GCS buckets.
Alternatively you can manage files yourself: load local file and then create dataset using tf.data.Dataset.from_tensor_slices method.
Ok, I learned that all the input files have to be on the Google Storage bucket. If your path does not start with gs:// the TPU will not work.

Universal Sentence Encoder load error "Error: SavedModel file does not exist at..."

I installed Uiniversal Sentence Encoder (Tensorflow 2) in 2 virtual environment with Ananconda. One is on Mac, anther is on Ubuntu.
All worked with following:
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
Installed with:
conda create -n my-tf2-env python=3.6 tensorflow
conda init bash
conda activate my-tf2-env
conda install -c conda-forge tensorflow-hub
But, for unknown reason after 3 weeks, Mac does not work with following error which fails at:
model = hub.load(module_url)
Error: SavedModel file does not exist at: /var/folders/99/8rwn_9hx3jj9x3qz6yf0j2f00000gp/T/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/{saved_model.pbtxt|saved_model.pb}
On Mac, I recreated new env with same procedure but has same error.
On Ubuntu, all works well.
I want to know how to fix Mac. Thank you for help.
What I attempted on Mac is that I tried to download "https://tfhub.dev/google/universal-sentence-encoder/4" to local drive and load it from local drive in future, not from web url. This process was not finished and not successful yet. I don't remember if there is anything downloaded to Mac with this attempt, that might corrupted Tensorflow-hub on login user account of my Mac.
This error usually occurs when the saved_model.pb is not present in the path specified in the module_url.
For example, if we consider the Folder structure as shown in the screenshot below,
The code,
import tensorflow_hub as hub
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
and
import tensorflow_hub as hub
module_url = "/home/mothukuru/Downloads/Hub"
model = hub.load(module_url)
work successfully.
But if saved_model.pb is not present in that Folder as shown below,
Executing the code,
import tensorflow_hub as hub
module_url = "/home/mothukuru/Downloads/Hub"
model = hub.load(module_url)
results in the below error,
OSError: SavedModel file does not exist at: /home/mothukuru/Downloads/Hub/{saved_model.pbtxt|saved_model.pb}
In your specific case, executing the code while the Download of the Model was in progress might have resulted in the error.
As stated in the comment, deleting the Downloaded File can fix the problem.
Please let me know if this answer has not resolved your issue and I will be happy to modify it accordingly.
TF Published some additional guidelines on caching models apparently in response to questions about this issue.
In my case, I was running this locally on Mac via a jupyter notebook.
I was not sure how to "Delete the download file" as suggest in the other answer, but I found this resolved my issue:
https://www.tensorflow.org/hub/caching#reading_from_remote_storage
Reading from remote storage
Users can instruct the tensorflow_hub
library to directly read models from remote storage (GCS) instead of
downloading the models locally with
os.environ["TFHUB_MODEL_LOAD_FORMAT"] = "UNCOMPRESSED"
or by setting the command-line flag --tfhub_model_load_format to UNCOMPRESSED. This way, no caching directory is needed, which is especially helpful in environments that provide little disk space but a fast internet connection.
I ran that command in my notebook, and then the error was immediately resolved.
Note: I assume this is slower, especially if you do not have a fast internet connection, since what you are doing is telling the program to not locally cache (store) a copy and to just download it on demand.

Submit a Keras training job to Google cloud

I am trying to follow this tutorial:
https://medium.com/#natu.neeraj/training-a-keras-model-on-google-cloud-ml-cb831341c196
to upload and train a Keras model on Google Cloud Platform, but I can't get it to work.
Right now I have downloaded the package from GitHub, and I have created a cloud environment with AI-Platform and a bucket for storage.
I am uploading the files (with the suggested folder structure) to my Cloud Storage bucket (basically to the root of my storage), and then trying the following command in the cloud terminal:
gcloud ai-platform jobs submit training JOB1
--module-name=trainer.cnn_with_keras
--package-path=./trainer
--job-dir=gs://mykerasstorage
--region=europe-north1
--config=gs://mykerasstorage/trainer/cloudml-gpu.yaml
But I get errors, first the cloudml-gpu.yaml file can't be found, it says "no such folder or file", and trying to just remove it, I get errors because it says the --init--.py file is missing, but it isn't, even if it is empty (which it was when I downloaded from the tutorial GitHub). I am Guessing I haven't uploaded it the right way.
Any suggestions of how I should do this? There is really no info on this in the tutorial itself.
I have read in another guide that it is possible to let gcloud package and upload the job directly, but I am not sure how to do this or where to write the commands, in my terminal with gcloud command? Or in the Cloud Shell in the browser? And how do I define the path where my python files are located?
Should mention that I am working with Mac, and pretty new to using Keras and Python.
I was able to follow the tutorial you mentioned successfully, with some modifications along the way.
I will mention all the steps although you made it halfway as you mentioned.
First of all create a Cloud Storage Bucket for the job:
gsutil mb -l europe-north1 gs://keras-cloud-tutorial
To answer your question on where you should write these commands, depends on where you want to store the files that you will download from GitHub. In the tutorial you posted, the writer is using his own computer to run the commands and that's why he initializes the gcloud command with gcloud init. However, you can submit the job from the Cloud Shell too, if you download the needed files there.
The only files we need from the repository are the trainer folder and the setup.py file. So, if we put them in a folder named keras-cloud-tutorial we will have this file structure:
keras-cloud-tutorial/
├── setup.py
└── trainer
├── __init__.py
├── cloudml-gpu.yaml
└── cnn_with_keras.py
Now, a possible reason for the ImportError: No module named eager error is that you might have changed the runtimeVersion inside the cloudml-gpu.yaml file. As we can read here, eager was introduced in Tensorflow 1.5. If you have specified an earlier version, it is expected to experience this error. So the structure of cloudml-gpu.yaml should be like this:
trainingInput:
scaleTier: CUSTOM
# standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4 GPUs
masterType: standard_gpu
runtimeVersion: "1.5"
Note: "standard_gpu" is a legacy machine type.
Also, the setup.py file should look like this:
from setuptools import setup, find_packages
setup(name='trainer',
version='0.1',
packages=find_packages(),
description='Example on how to run keras on gcloud ml-engine',
author='Username',
author_email='user#gmail.com',
install_requires=[
'keras==2.1.5',
'h5py'
],
zip_safe=False)
Attention: As you can see, I have specified that I want version 2.1.5 of keras. This is because if I don't do that, the latest version is used which has compatibility issues with versions of Tensorflow earlier than 2.0.
If everything is set, you can submit the job by running the following command inside the folder keras-cloud-tutorial:
gcloud ai-platform jobs submit training test_job --module-name=trainer.cnn_with_keras --package-path=./trainer --job-dir=gs://keras-cloud-tutorial --region=europe-west1 --config=trainer/cloudml-gpu.yaml
Note: I used gcloud ai-platform instead of gcloud ml-engine command although both will work. At some point in the future though, gcloud ml-engine will be deprecated.
Attention: Be careful when choosing the region in which the job will be submitted. Some regions do not support GPUs and will throw an error if chosen. For example, if in my command I set the region parameter to europe-north1 instead of europe-west1, I will receive the following error:
ERROR: (gcloud.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED:
Quota failure for project . The request for 1 K80
accelerators exceeds the allowed maximum of 0 K80, 0 P100, 0 P4, 0 T4,
0 TPU_V2, 0 TPU_V3, 0 V100. To read more about Cloud ML Engine quota,
see https://cloud.google.com/ml-engine/quotas.
- '#type': type.googleapis.com/google.rpc.QuotaFailure violations:
- description: The request for 1 K80 accelerators exceeds the allowed maximum of
0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V3, 0 V100.
subject:
You can read more about the features of each region here and here.
EDIT:
After the completion of the training job, there should be 3 folders in the bucket that you specified: logs/, model/ and packages/. The model is saved on the model/ folder a an .h5 file. Have in mind that if you set a specific folder for the destination you should include the '/' at the end. For example, you should set gs://my-bucket/output/ instead of gs://mybucket/output. If you do the latter you will end up with folders output, outputlogs and outputmodel. Inside output there should be packages. The job page link should direct to output folder so make sure to check the rest of the bucket too!
In addition, in the AI-Platform job page you should be able to see information regarding CPU, GPU and Network utilization:
Also, I would like to clarify something as I saw that you posted some related questions as an answer:
Your local environment, either it is your personal Mac or the Cloud Shell has nothing to do with the actual training job. You don't need to install any specific package or framework locally. You just need to have the Google Cloud SDK installed (in Cloud Shell is of course already installed) to run the appropriate gcloud and gsutil commands. You can read more on how exactly training jobs on the AI-Platform work here.
I hope that you will find my answer helpful.
I got it to work halfway now by not uploading the files but just running the upload commands from cloud at my local terminal... however there was an error during it running ending in "job failed"
Seems it was trying to import something from the TensorFlow backend called "from tensorflow.python.eager import context" but there was an ImportError: No module named eager
I have tried "pip install tf-nightly" which was suggested at another place, but it says I don't have permission or I am loosing the connection to cloud shell(exactly when I try to run the command).
I have also tried making a virtual environment locally to match that on gcloud (with Conda), and have made an environment with Conda with Python=3.5, Tensorflow=1.14.0 and Keras=2.2.5, which should be supported for gcloud.
The python program works fine in this environment locally, but I still get the (ImportError: No module named eager) when trying to run the job on gcloud.
I am putting the flag --python-version 3.5 when submitting the job, but when I write the command "Python -V" in the google cloud shell, it says Python=2.7. Could this be the issue? I have not fins a way to update the python version with the cloud shell prompt, but it says google cloud should support python 3.5. If this is anyway the issue, any suggestions on how to upgrade python version on google cloud?
It is also possible to manually there a new job in the google cloud web interface, doing this, I get a different error message: ERROR: Could not find a version that satisfies the requirement cnn_with_keras.py (from versions: none) and No matching distribution found for cnn_with_keras.py. Where cnn_with_keras.py is my python code from the tutorial, which runs fine locally.
Really don't know what to do next. Any suggestions or tips would be very helpful!
The issue with the GPU is solved now, it was something so simple as, my google cloud account had GPU settings disabled and needed to be upgraded.

Google Cloud ML Engine: Trouble With Local Prediction Given saved_model.pb

I've trained a Keras model using the tf.data.Dataset API and am trying to see if I've saved it (as saved_model.pb) correctly, so I can use it on ML Engine. Here's what I've done:
estimator = tf.keras.estimator.model_to_estimator(my_model)
# create serving function...
estimator.export_savedmodel('./export', serving_fn)
So now I'm trying to use gcloud ml-engine local predict to see if I can get a prediction back. I'm doing:
gcloud ml-engine local predict --model-dir=~/path/to/folder --json-instances=instances.json
Unfortunately, I get:
cloud.ml.prediction.prediction_utils.PredictionError: Failed to load model: Cloud ML only supports TF 1.0 or above and models saved in SavedModel format. (Error code: 0)
So then I try adding --runtime-version=1.2 to my command like this:
gcloud ml-engine local predict --model-dir=~/path/to/folder --json-instances=instances.json --runtime-version=1.2
and I get back:
ERROR: (gcloud.ml-engine.local.predict) unrecognized arguments: --runtime-version=1.2
Any idea what I'm doing incorrectly / how to fix?
Thanks!
For posterity: the problem turned out to be an incorrect path. If anybody else encounters this issue try using the full absolute path and ensure you are point to the directory containing the saved_model.pb file.

How to allow soft device placement when deploying a TensorFlow model to GCP?

I am trying to deploy a TensorFlow model to GCP's Cloud Machine Learning Engine for prediction, but I get the following error:
$> gcloud ml-engine versions create v1 --model $MODEL_NAME --origin $MODEL_BINARIES --runtime-version 1.9
Creating version (this might take a few minutes)......failed.
ERROR: (gcloud.ml-engine.versions.create) Bad model detected with error: "Failed to load model: Loading servable: {name: default version: 1} failed: Invalid argument: Cannot assign a device for operation 'tartarus/dense_2/bias': Operation was explicitly assigned to /device:GPU:3 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.\n\t [[Node: tartarus/dense_2/bias = VariableV2[_class=[\"loc:#tartarus/dense_2/bias\"], _output_shapes=[[200]], container=\"\", dtype=DT_FLOAT, shape=[200], shared_name=\"\", _device=\"/device:GPU:3\"]()]]\n\n (Error code: 0)"
My model was trained on several GPUs, and it seems like the default machines on CMLE don't support GPU for prediction, hence the error I get. So, I am wondering if the following is possible:
Set the allow_soft_placement var to True, so that CMLE can use the CPU instead of the GPU for a given model.
Activate GPU prediction on CMLE for a given model.
If not, how can I deploy a TF model trained on GPUs to CMLE for prediction? It feels like this should be a straightforward feature to use, but I can't find any documentation about it.
Thanks!
I've never used gcloud ml-engine versions create, but when you deploy a training job with gcloud ml-engine jobs submit training, you can add a config flag that identifies a configuration file.
This file lets you identify the target machine for training, and you can use multiple CPUs and GPUs. The documentation for the configuration file is here.