TensorFlow 1.0.1 SavedModelBuilder - tensorflow

I'm currently doing exploration on deploying models on Google ML Engine. At first, I developed a model using TensorFlow 1.1.0 as it's the latest version exist (by the time this question is asked). However, it turned out that the highest supported version of TensorFlow on GCP is 1.0.1.
The problem is, previously when I was using TensorFlow 1.1.0, SavedModelBuilder would correctly save the model as SavedModel and its variables under variables/ directory. However, when I switch to TensorFlow 1.0.1, it didn't work similarly: The SavedModel file was created, but no files was created under variables/ and hence no model can be built using only the SavedModel file (missing files under variables/).
Is it a known bug? Or should I do something in order to make the SavedModelBuilder on TensorFlow 1.0.1 works as what TensorFlow 1.1.0 do?
Thank you.
EDIT, more detail:
Actually, there is no explicit tf.Variables exist in my model. However, there exist several tf.contrib.lookup.MutableDenseHashTables and they're exported correctly in TensorFlow 1.1.0, but not in TensorFlow 1.0.1 (as no variable was exported at all in 1.0.1).

It looks like the ability to save and load models in TensorFlow without variables was introduced in this commit which is only available in 1.1.0.
As a workaround, you could create a dummy (unused) variable in your model.
Edit:
Based on OP update, it sounds like there is a MutableDenseHashTable that isn't being saved out.
You can run TensorFlow 1.1 on CloudML Engine, but it requires manually adding it as an additional package.
First, download the TensorFlow 1.1 wheel. Then specify it as an additional package to your training job, e.g.,
gcloud ml-engine jobs submit training my_job \
--module-name trainer.task \
--staging-bucket gs://my-bucket \
--package-path /my/code/path/trainer \
--packages tensorflow-1.1.0-cp27-cp27mu-manylinux1_x86_64.whl

Related

Training using object detection api is not running on GPUs in AI Platform

I am trying to run the training of some models in tensorflow 2 object detection api.
I am using this command:
gcloud ai-platform jobs submit training segmentation_maskrcnn_`date +%m_%d_%Y_%H_%M_%S` \
--runtime-version 2.1 \
--python-version 3.7 \
--job-dir=gs://${MODEL_DIR} \
--package-path ./object_detection \
--module-name object_detection.model_main_tf2 \
--region us-central1 \
--scale-tier CUSTOM \
--master-machine-type n1-highcpu-32 \
--master-accelerator count=4,type=nvidia-tesla-p100 \
-- \
--model_dir=gs://${MODEL_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
The training job is submitted successfully but when I look at my submitted job on AI platform I notice that it's not using the GPUs!
Also, when looking at the logs for my training job, I noticed that in some cases it couldn't open cuda. It would say something like this:
Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib64
I was using AI platform for training a few months back and it was successful. I don't know what has changed now!
In fact, for my own setup, nothing has changed.
For the record, I am training Mask RCNN now. A few months back I trained Faster RCNN and SSD models.
Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/nvidia/lib64
I'm not sure as I couldn't test anyhow. With a quick google search, It appeared that people have encountered this issue for many reasons and the solution is some kind of depends. In SO, there is the same query was asked, and you probably missed it somehow, check it first, here.
Also, check this related issue posted below
TensorFlow Issue #26182
TensorFlow Issue #45930
TensorFlow Issue #38578
After checking with every possible solution, and still remains the issue, then update your query with it.
I think there some mismatches in your Cuda version (CUDA, cuDNN) and tf version, you should check them first in your working environment. Also, ensure you update the Cuda path properly. According to the given error message, you need to make it ensure that the following set properly.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.0/lib64/

pickle.load cannot open up a (Stylegan2 network) pickle model on my local machine, but can on the cloud

Stylegan2 uses network pickle files to store ML models. I transfer trained one model, which I am able to open up on cloud servers. I have been generating images from this model fine with the following setup:
Google Colab: Python 3.6.9, CUDA 10.1, tensorflow-gpu 1.15, CuDNN 7.6.5
However, I cannot open the network pickle file on my local machine, even though I've been trying to replicate that cloud setup the best I can. (I have the right GPU hardware/drivers/etc.)
Local (Windows 10) Python 3.6.9, CUDA 10.1, tensorflow-gpu 1.15, CuDNN 7.6.5
Requires a library 'dnnlib' to be in the PYTHONPATH and for a tf.Session() to be initialized
I get the an assertion error about the pickle.
**Assertion error**: `assert state["version"] in [2,3]`
I find this error very odd because the network pickle works on the cloud--so it was saved properly. Additionally, my local set up can open up other network pickles(ie. ones downloaded from the internet through GET requests), making me think that I have properly set up my PYTHONPATH and initialized a tf.Session. These are prerequisites listed in the Stylegan repo:
"You can import the networks in your own Python code using pickle.load(). For this to work, you need to include the dnnlib source directory in PYTHONPATH and create a default TensorFlow session by calling dnnlib.tflib.init_tf()"
I'm not sure why I cannot open up this pickle in one environment, but can in another. Does anyone have any suggestions as to where I might start looking?
Actually, I figured it out by printing out what version was throwing the error. The version printed was '4'. I realized that this matched the pickle (HIGHEST_PROTOCOL) and that what I needed was the newest pull of the Stylegan2 repository, which included pickle format_version 4 in their allowed versions.

tensorboard: cannot open subgraph

Recently I wrote a my custom operator (and its gradient) in python following this post
Tensorflow: Custom operation used in two networks simultaneously produces nan
Tensorflow runs with no error and the prediction gives expected accuracy. However, when I want to visualize this graph with tensorboard. I find that I cannot open the subgraph to see its structures. But it's gradient subgraph can be opened and seen. Does anyone has some idea about this problem?
Fig.1: subgraph fc1 cannot be opened but gradient/fc1 can be opened.
I can open the fc1 metanode on TensorBoard 0.1.8.
What version of TensorBoard you are using? You can find the version via running
python -c 'from tensorboard import version; print(version.VERSION)'
After that, could you try upgrading tensorboard via
pip install --upgrade tensorflow-tensorboard
and let me know if the issue persists?

How can I serve the Faster RCNN with Resnet 101 model with tensorflow serving

I am trying to serve the Faster RCNN with Resnet 101 model with tensorflow serving.
I know I need to use tf.saved_model.builder.SavedModelBuilder to export the model definition as well as variables, then I need a script like inception_client.py provided by tensorflow_serving.
while I am going through the examples and documentation and experimenting, I think someone may have done the same thing. So plase help if you have done the same or know how to get it done. Thanks in advance.
Tensorflow Object Detection API has its own exporter script that is more sophisticated than the outdated examples found under Tensorflow Serving.
While building Tensorflow Serving, make sure you pull the latest master commit of tensorflow/tensorflow (>r1.2) and tensorflow/models
Build Tensorflow Serving for GPU
bazel build -c opt --config=cuda tensorflow_serving/...
If you face errors regarding crosstool and nccl, follow the solutions at
https://github.com/tensorflow/serving/issues/186#issuecomment-251152755
https://github.com/tensorflow/serving/issues/327#issuecomment-305771708
Usage
python tf_models/object_detection/export_inference_graph.py \
--pipeline_config_path=/path/to/ssd_inception_v2.config \
--trained_checkpoint_prefix=/path/to/trained/checkpoint/model.ckpt \
--output_directory /path/to/output/1 \
--export_as_saved_model \
--input_type=image_tensor
Note that during export all variables are converted into constants and baked into the protobuf binary. Don't be panicked if you don't find any files under saved_model/variables directory
To start the server,
bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server --port=9000 --model_name=inception_v2 --model_base_path=/path/to/output --enable_batching=true
As for the client, the examples under Tensorflow Serving work well

Does a "tiny" Tensorflow / Keras exist which can be executed with AWS Lambda?

The Python Tensorflow package is huge and AWS Lambda allows only 250 MB in which you have to bring all resources in a zip file, including all dependencies.
Is it possible to have a "minified" Tensorflow / Keras?
Yes it is possible by using serverless framework with serverless-ephemeral
A specifc tensorflow packager is included