Tensorflow Object Detection API: Training gets stuck at step=0 for ssd + mobilenetv2 with custom data - tensorflow

I wanted to do transfer learning using a ssd + mobilenetv2 model with my own images. I have only one class. The images were downloaded from OpenImageDataSet. I used tensorflow's object detection API. But the training stuck at step = 0.
I verified that the TFRecord was correctly created as I can use the same data to train faster_rcnn with object detetion APIs. I created my own config file using the one in the repos: ssd_mobilenet_v2_oid_v4.config.
I also tried to start with ssd_mobilenet_v2_coco_2018_03_29.tar.gz using corresponding config file. The behavior is the same -- it also stuck at the same place.
####################
CONSOLE LOG:
Instructions for updating:
Use standard file utilities to get mtimes.
INFO:tensorflow:Running local_init_op.
I0416 16:30:39.198738 19792 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0416 16:30:39.632495 19792 session_manager.py:502] Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into D:\work\cv\others\my-tf2-od-transfer-ssd-mobilenet-v2\model.ckpt.
I0416 16:30:48.724722 19792 basic_session_run_hooks.py:606] Saving checkpoints for 0 into D:\work\cv\others\my-tf2-od-transfer-ssd-mobilenet-v2\model.ckpt.
2020-04-16 16:30:59.919297: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-04-16 16:31:00.964680: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows
Relying on driver to perform ptx compilation. This message will be only logged once.
2020-04-16 16:31:00.986098: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
INFO:tensorflow:loss = 12.512502, step = 0
I0416 16:31:02.740392 19792 basic_session_run_hooks.py:262] loss = 12.512502, step = 0 [STUCK HERE]

are you sure it is stuck? do you get any errors?
During the training process, TF OD API writes logs into an event file (can be opened using tensorboard) in the model directory.
look in your model directory and see if there is an eventfile written there, look at its time stamp to see if it is being updated.

I found out that the combination of TF 1.15 GPU version + my setup causes the problem: "Invoking ptxas not supported on Windows". Downgrading it to TF 1.14 GPU or using TF 1.15 CPU solves the issue. It is a common and open issue on Tensorflow: HERE

Related

Python 3.8.8 Jupyter notebook kernel dies when I call model.fit() when I try to use my GPU

My tensorflow recognizes my gpu
However, when I call model.fit() on my data it shows:
epoch(1/2) and then the kernel dies immediately
If I run this in a separate virtual environment with no GPU it works fine:
I have simplified the model architecture and number of training points to only ten as a quick test and it still fails
Simple example
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
model = keras.Sequential()
model.add(Dense(4,
activation='relu'))
model.add(Dense(1, activation='sigmoid'))
opt = keras.optimizers.Adam(learning_rate=.001)
model.compile(loss = 'binary_crossentropy' , optimizer = opt, metrics = ['accuracy'] )
info = model.fit(X_train, y_train, epochs=2, batch_size=2,shuffle=True, verbose=1)
versions:
Python 3.8.8
Num GPUs Available 1
2.5.0-dev20210227
2.4.3
cuda v11.2
I am going to answer my own question rather than deleting this because maybe someone else will be making the same simple mistake I was.
The main mistake I made was having the incorrect CUDA download. you can refer to the what versions are correct at this link:
https://www.tensorflow.org/install/source#gpu
TLDR: Just follow this video:
https://www.youtube.com/watch?v=hHWkvEcDBO0
This also highlighted the importance of a virtual environment where you control the package versions to prevent incompatibilities.
I had the same problem. I transferred the code into a python file and found the root cause. In my case it was copying cudnn dll files into C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\bin. Check the following link as well:
Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found

FinalExporter not working in TensorFlow 2.1 on Google AI-Platform

I'm trying to upgrade my model to use AI-Platform 2.1 instead of 1.15, but I can't get the FinalExporter to work.
I followed the steps outlined in ai-platform: No eval folder or export folder in outputs when running TensorFlow 2.1 training job using Estimators and I've gotten it to a place where:
The evaluation metrics are exported to the eval folder
Both the BestExporter and LatestExporter are successfully exporting the model
The FinalExporter does not export any model
The code I'm using for this is similar to:
import tensorflow as tf
...
estimator = tf.estimator.Estimator(...)
train_spec = tf.estimator.TrainSpec(...)
final_exporter = tf.estimator.FinalExporter("final", ...)
latest_exporter = tf.estimator.LatestExporter("latest", ...)
best_exporter = tf.estimator.BestExporter("best", ...)
eval_spec = tf.estimator.EvalSpec(
input_fn=eval_input_fn,
exporters=[latest_exporter, final_exporter, best_exporter],
name="eval",
)
tf.estimator.train_and_evaluate(estimator=estimator, train_spec=train_spec, eval_spec=eval_spec)
and I'm using the following yaml config file
trainingInput:
runtimeVersion: "2.1"
pythonVersion: "3.7"
scaleTier: CUSTOM
masterType: standard_v100
evaluatorType: standard_gpu
evaluatorCount: 1
The problem seems to be that the model is no longer being evaluated after the final training step. This can be seen in the tensorboard where the final eval is run after the last training metrics are exported in 1.15. This is no longer the case in 2.1.
Tensorboard comparing the steps for which the last losses were recorded.
Logs
The problem with the model no longer being evaluated after the final training step is supported by the logs:
2020-10-27 09:06:03.504 EDT
master-replica-0
"Saving checkpoints for 77872 into ...
...
2020-10-27 09:06:19.033 EDT
evaluator-replica-0
"Calling model_fn."
...
2020-10-27 09:06:20.093 EDT
master-replica-0
"Loss for final step: 50.796585."
...
2020-10-27 09:06:28.005 EDT service Tearing down training program.
...
2020-10-27 09:06:28.430 EDT evaluator-replica-0 "Terminated by service. If the job is supposed to continue running, it will be restarted on other VM shortly."
which indicates that the evaluator-replica-0 is being shut down as soon training has finished, while in the middle of evaluating.
Is this a bug in TF/AI-Platform 2.1 or do I have to do something differently to ensure that the evaluator evaluates the model (and exports it) after the final training step?

Incorrect freezing of weights maskrcnn Tensorflow 2 in object_detection_API

I am training the maskrcnn inception v2 model on the Tensorflow version for further work with OpenVino. After training the model, I freeze the model using a script in object_detection_API directory:
python exporter_main_v2.py \
--trained_checkpoint_dir training
--output_directory inference_graph
--pipeline_config_path training/mask_rcnn_inception_resnet_v2_1024x1024_coco17_gpu-8.config
After this script, I get the saved model and pipeline files, which should be used in OpenVInO in the future
The following error occurs when uploading the received files to model optimizer:
Model Optimizer version:
2020-08-20 11:37:05.425293: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
[ FRAMEWORK ERROR ] Cannot load input model: TensorFlow cannot read the model file: "C:\Users\Anna\Downloads\inference_graph\inference_graph\saved_model\saved_model.pb" is incorrect TensorFlow model file.
The file should contain one of the following TensorFlow graphs:
frozen graph in text or binary format
inference graph for freezing with checkpoint (--input_checkpoint) in text or binary format
meta graph
Make sure that --input_model_is_text is provided for a model in text format. By default, a model is interpreted in binary format. Framework error details: Error parsing message.
For more information please refer to Model Optimizer FAQ (https://docs.openvinotoolkit.org/latest/_docs_MO_DG_prepare_model_Model_Optimizer_FAQ.html), question #43.
I teach the model by following the example from the link article, using my own dataset: https://gilberttanner.com/blog/train-a-mask-r-cnn-model-with-the-tensorflow-object-detection-api
On gpu, the model starts and works, but I need to get the converted model for OpenVINO
Run the mo_tf.py script with a path to the SavedModel directory:
python3 mo_tf.py --saved_model_dir <SAVED_MODEL_DIRECTORY>

How to prevent TPUEstimator from using GPU or TPU

I need to force TPUEstimator to use the CPU. I have a rented google machine and the GPU is already running training. Since the CPUs are idle, I want to start a second Tensorflow session for evaluation but I want to force the evaluation cycle to use CPUs only so that it does not steal GPU time.
I am assuming there is a flag in the run_config or similar for doing this but am struggling to find one in the TF documentation.
run_config = tf.contrib.tpu.RunConfig(
cluster=tpu_cluster_resolver,
master=FLAGS.master,
model_dir=FLAGS.output_dir,
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=FLAGS.iterations_per_loop,
num_shards=FLAGS.num_tpu_cores,
per_host_input_for_training=is_per_host))
You can run a TPUEstimator locally by including two arguments: (1) use_tpu should be set to False, and (2) tf.contrib.tpu.RunConfig should be passed as the config argument.
my_tpu_estimator = tf.contrib.tpu.TPUEstimator(
model_fn=my_model_fn,
config=tf.contrib.tpu.RunConfig()
use_tpu=False)
The majority of example TPU models can be run in local mode by setting the command line flags:
$> python mnist_tpu.py --use_tpu=false --master=''
More documentation can be found here.

MirroredStrategy doesn't use GPUs

I wanted to use the tf.contrib.distribute.MirroredStrategy() on my Multi GPU System but it doesn't use the GPUs for the training (see the output below). Also I am running tensorflow-gpu 1.12.
I did try to specify the GPUs directly in the MirroredStrategy, but the same problem appeared.
model = models.Model(inputs=input, outputs=y_output)
optimizer = tf.train.AdamOptimizer(LEARNING_RATE)
model.compile(loss=lossFunc, optimizer=optimizer)
NUM_GPUS = 2
strategy = tf.contrib.distribute.MirroredStrategy(num_gpus=NUM_GPUS)
config = tf.estimator.RunConfig(train_distribute=strategy)
estimator = tf.keras.estimator.model_to_estimator(model,
config=config)
These are the results I am getting:
INFO:tensorflow:Device is available but not used by distribute strategy: /device:CPU:0
INFO:tensorflow:Device is available but not used by distribute strategy: /device:GPU:0
INFO:tensorflow:Device is available but not used by distribute strategy: /device:GPU:1
WARNING:tensorflow:Not all devices in DistributionStrategy are visible to TensorFlow session.
The expected result would be obviously to run the training on a Multi GPU system. Are those known issues?
I've been facing a similar issue with MirroredStrategy failing on tensorflow 1.13.1 with 2x RTX2080 running an Estimator.
The failure seems to be in the NCCL all_reduce method (error message - no OpKernel registered for NCCL AllReduce).
I got it to run by changing from NCCL to hierarchical_copy, which meant using the contrib cross_device_ops methods as follows:
Failed command:
mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"])
Successful command:
mirrored_strategy = tf.distribute.MirroredStrategy(devices=["/gpu:0","/gpu:1"],
cross_device_ops=tf.contrib.distribute.AllReduceCrossDeviceOps(
all_reduce_alg="hierarchical_copy")
)
In TensorFlow new version, AllReduceCrossDeviceOps isn't exist. You may use distribute.HierarchicalCopyAllReduce() instead:
mirrored_strategy = tf.distribute.MirroredStrategy(devices= ["/gpu:0","/gpu:1"],cross_device_ops=tf.distribute.HierarchicalCopyAllReduce())