FinalExporter not working in TensorFlow 2.1 on Google AI-Platform - tensorflow

I'm trying to upgrade my model to use AI-Platform 2.1 instead of 1.15, but I can't get the FinalExporter to work.
I followed the steps outlined in ai-platform: No eval folder or export folder in outputs when running TensorFlow 2.1 training job using Estimators and I've gotten it to a place where:
The evaluation metrics are exported to the eval folder
Both the BestExporter and LatestExporter are successfully exporting the model
The FinalExporter does not export any model
The code I'm using for this is similar to:
import tensorflow as tf
...
estimator = tf.estimator.Estimator(...)
train_spec = tf.estimator.TrainSpec(...)
final_exporter = tf.estimator.FinalExporter("final", ...)
latest_exporter = tf.estimator.LatestExporter("latest", ...)
best_exporter = tf.estimator.BestExporter("best", ...)
eval_spec = tf.estimator.EvalSpec(
input_fn=eval_input_fn,
exporters=[latest_exporter, final_exporter, best_exporter],
name="eval",
)
tf.estimator.train_and_evaluate(estimator=estimator, train_spec=train_spec, eval_spec=eval_spec)
and I'm using the following yaml config file
trainingInput:
runtimeVersion: "2.1"
pythonVersion: "3.7"
scaleTier: CUSTOM
masterType: standard_v100
evaluatorType: standard_gpu
evaluatorCount: 1
The problem seems to be that the model is no longer being evaluated after the final training step. This can be seen in the tensorboard where the final eval is run after the last training metrics are exported in 1.15. This is no longer the case in 2.1.
Tensorboard comparing the steps for which the last losses were recorded.
Logs
The problem with the model no longer being evaluated after the final training step is supported by the logs:
2020-10-27 09:06:03.504 EDT
master-replica-0
"Saving checkpoints for 77872 into ...
...
2020-10-27 09:06:19.033 EDT
evaluator-replica-0
"Calling model_fn."
...
2020-10-27 09:06:20.093 EDT
master-replica-0
"Loss for final step: 50.796585."
...
2020-10-27 09:06:28.005 EDT service Tearing down training program.
...
2020-10-27 09:06:28.430 EDT evaluator-replica-0 "Terminated by service. If the job is supposed to continue running, it will be restarted on other VM shortly."
which indicates that the evaluator-replica-0 is being shut down as soon training has finished, while in the middle of evaluating.
Is this a bug in TF/AI-Platform 2.1 or do I have to do something differently to ensure that the evaluator evaluates the model (and exports it) after the final training step?

Related

Incorrect freezing of weights maskrcnn Tensorflow 2 in object_detection_API

I am training the maskrcnn inception v2 model on the Tensorflow version for further work with OpenVino. After training the model, I freeze the model using a script in object_detection_API directory:
python exporter_main_v2.py \
--trained_checkpoint_dir training
--output_directory inference_graph
--pipeline_config_path training/mask_rcnn_inception_resnet_v2_1024x1024_coco17_gpu-8.config
After this script, I get the saved model and pipeline files, which should be used in OpenVInO in the future
The following error occurs when uploading the received files to model optimizer:
Model Optimizer version:
2020-08-20 11:37:05.425293: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
[ FRAMEWORK ERROR ] Cannot load input model: TensorFlow cannot read the model file: "C:\Users\Anna\Downloads\inference_graph\inference_graph\saved_model\saved_model.pb" is incorrect TensorFlow model file.
The file should contain one of the following TensorFlow graphs:
frozen graph in text or binary format
inference graph for freezing with checkpoint (--input_checkpoint) in text or binary format
meta graph
Make sure that --input_model_is_text is provided for a model in text format. By default, a model is interpreted in binary format. Framework error details: Error parsing message.
For more information please refer to Model Optimizer FAQ (https://docs.openvinotoolkit.org/latest/_docs_MO_DG_prepare_model_Model_Optimizer_FAQ.html), question #43.
I teach the model by following the example from the link article, using my own dataset: https://gilberttanner.com/blog/train-a-mask-r-cnn-model-with-the-tensorflow-object-detection-api
On gpu, the model starts and works, but I need to get the converted model for OpenVINO
Run the mo_tf.py script with a path to the SavedModel directory:
python3 mo_tf.py --saved_model_dir <SAVED_MODEL_DIRECTORY>

Tensorflow Object Detection API: Training gets stuck at step=0 for ssd + mobilenetv2 with custom data

I wanted to do transfer learning using a ssd + mobilenetv2 model with my own images. I have only one class. The images were downloaded from OpenImageDataSet. I used tensorflow's object detection API. But the training stuck at step = 0.
I verified that the TFRecord was correctly created as I can use the same data to train faster_rcnn with object detetion APIs. I created my own config file using the one in the repos: ssd_mobilenet_v2_oid_v4.config.
I also tried to start with ssd_mobilenet_v2_coco_2018_03_29.tar.gz using corresponding config file. The behavior is the same -- it also stuck at the same place.
####################
CONSOLE LOG:
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
I0416 16:30:39.198738 19792 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0416 16:30:39.632495 19792 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into D:\work\cv\others\my-tf2-od-transfer-ssd-mobilenet-v2\model.ckpt.
I0416 16:30:48.724722 19792 basic_session_run_hooks.py:606] Saving checkpoints for 0 into D:\work\cv\others\my-tf2-od-transfer-ssd-mobilenet-v2\model.ckpt.
2020-04-16 16:30:59.919297: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-16 16:31:00.964680: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows
Relying on driver to perform ptx compilation. This message will be only logged once.
2020-04-16 16:31:00.986098: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
INFO:tensorflow:loss = 12.512502, step = 0
I0416 16:31:02.740392 19792 basic_session_run_hooks.py:262] loss = 12.512502, step = 0 [STUCK HERE]
are you sure it is stuck? do you get any errors?
During the training process, TF OD API writes logs into an event file (can be opened using tensorboard) in the model directory.
look in your model directory and see if there is an eventfile written there, look at its time stamp to see if it is being updated.
I found out that the combination of TF 1.15 GPU version + my setup causes the problem: "Invoking ptxas not supported on Windows". Downgrading it to TF 1.14 GPU or using TF 1.15 CPU solves the issue. It is a common and open issue on Tensorflow: HERE

How can I train with my own dataset with darkflow?

I'm a beginner with some programming experince. I'm trying to train darkflow with my own dataset. I'm following these instructions.
https://github.com/thtrieu/darkflow
So far I have done the following steps.
installed darkflow and the relevant modules
created test images and made annotations (Pascal VOC).
https://ibb.co/y4HmtGz
https://ibb.co/GkxLshK
If I have understood correctly the darkflow training requires Pascal VOC?
My problem is that I don't know how to start the training. How can I start the training process and how can I test if the neuralnet is working? Am I supposed to get weights as a result of training?
You can choose to use pre-trained weights from here. Download cfg and weights.
Assuming you have darkflow installed, you can train your network like this:
flow --model cfg/<your-config-filename>.cfg --load bin/<filename>.weights --train --annotation train/Annotations --dataset train/Images --epoch 100 --gpu 1.0
If you want to train your network from scratch w/o using any pre-trained weights,
you can do this:
flow --model cfg/<your-config-filename>.cfg --train --annotation train/Annotations --dataset train/Images --epoch 100 --gpu 1.0
After the training starts, model checkpoints are saved inside ckpt directory. You can load latest checkpoint and test on sample images.

Tensorflow Models - Object Detection how to specify epochs details?

I am training the pascal dataset for object detection on my laptop, I get output as "Skipping training since max_steps has already saved", getting a step lower I could see that the pipeline file generated has the epochs as 1. I am looking how to specify the epochs to training model.
Command used -
$ python object_detection/model_main.py --pipeline_config_path=/home/sshetty/github/lab/pascal/models/ssdlite_mobilenet_v2_coco/ssd_mobilenet_v2_coco.config --model_dir=/home/sshetty/github/lab/pascal/models/ssdlite_mobilenet_v2_coco/ --num_train_steps=50000 --alsologtostderr --verbosity=1 --stderrthreshold=debug
Output -
I0124 17:29:36.659717 139972583880448 tf_logging.py:115] Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps None or save_checkpoints_secs 600.
INFO:tensorflow:Skipping training since max_steps has already saved.
I0124 17:29:36.662388 139972583880448 tf_logging.py:115] Skipping training since max_steps has already saved.

Using model optimizer for tensorflow slim models

I am aiming to inference tensorflow slim model with Intel OpenVINO optimizer. Using open vino docs and slides for inference and tf slim docs for training model.
It's a multi-class classification problem. I have trained tf slim mobilnet_v2 model from scratch (using sript train_image_classifier.py). Evaluation of trained model on test set gives relatively good results to begin with (using script eval_image_classifier.py):
eval/Accuracy[0.8017]eval/Recall_5[0.9993]
However, single .ckpt file is not saved (even though at the end of train_image_classifier.py run there is a message like "model.ckpt is saved to checkpoint_dir"), there are 3 files (.ckpt-180000.data-00000-of-00001, .ckpt-180000.index, .ckpt-180000.meta) instead.
OpenVINO model optimizer requires a single checkpoint file.
According to docs I call mo_tf.py with following params:
python mo_tf.py --input_model D:/model/mobilenet_v2_224.pb --input_checkpoint D:/model/model.ckpt-180000 -b 1
It gives the error (same if pass --input_checkpoint D:/model/model.ckpt):
[ ERROR ] The value for command line parameter "input_checkpoint" must be existing file/directory, but "D:/model/model.ckpt-180000" does not exist.
Error message is clear, there are not such files on disk. But as I know most tf utilities convert .ckpt-????.meta to .ckpt under the hood.
Trying to call:
python mo_tf.py --input_model D:/model/mobilenet_v2_224.pb --input_meta_graph D:/model/model.ckpt-180000.meta -b 1
Causes:
[ ERROR ] Unknown configuration of input model parameters
It doesn't matter for me in which way I will transfer graph to OpenVINO intermediate representation, just need to reach that result.
Thanks a lot.
EDIT
I managed to run OpenVINO model optimizer on frozen graph of tf slim model. However I still have no idea why had my previous attempts (based on docs) failed.
you can try converting the model to frozen format (.pb) and then convert the model using OpenVINO.
.ckpt-meta has the metagraph. The computation graph structure without variable values.
the one you can observe in tensorboard.
.ckpt-data has the variable values,without the skeleton or structure. to restore a model we need both meta and data files.
.pb file saves the whole graph (meta+data)
As per the documentation of OpenVINO:
When a network is defined in Python* code, you have to create an inference graph file. Usually, graphs are built in a form that allows model training. That means that all trainable parameters are represented as variables in the graph. To use the graph with the Model Optimizer, it should be frozen.
https://software.intel.com/en-us/articles/OpenVINO-Using-TensorFlow
the OpenVINO optimizes the model by converting the weighted graph passed in frozen form.