encounter error during deeplab v3+ training on Cityscapes Semantic Segmentation Dataset - tensorflow

all,
I start the training process using deeplab v3+ following this guide. However, after step 1480, I got the error:
Error reported to Coordinator: Nan in summary histogram for: image_pooling/BatchNorm/moving_variance_2
The detailed train log is here
Could someone suggest how to solve this issue? THX!

Based on the log, it seems that you are training with batch_size = 1, fine_tune_batch_norm = True (default value). Since you are fine-tuning batch norm during training, it is better to set batch size as large as possible (see comments in train.py and Q5 in FAQ). If only limited GPU memory is available, you could fine-tune from the provided pre-trained checkpoint, set smaller learning rate and fine_tune_batch_norm = False (see model_zoo.md for details). Note make sure the flag tf_initial_checkpoint has correct path to the desired pre-trained checkpoint.

Related

Evaluation loss in tensorboard is always 0 when training music_vae

I'm training my own music_vae models and I've noticed that the evaluation loss and accuracy is always 0 when viewed in tensorboard. This is strange because I follow a similar process to train the RNNs, but the evaluation loss and accuracy look good with the RNNs.
Here is what I'm seeing in tensorboard:
tensorboard legend
evaluation loss
accuracy
Finally, here is what I'm seeing inside the eval folder. As you can see there is some data there:
Eval folder contents
Any help on this issue would be appreciated! Thanks

Different results in TfLite model vs model before quantization

I have taken Object Detection model from TF zoo v2,
I took mobilenet and trained it on my own TFrecords
I am using mobilenet because it is often found in the examples of converting it to Tflite and this is what I need because I run it on RPi3.
I am following ideas from the official example from Sagemaker docs
and github you can find here
What is interesting the accuracy done after step 2) training and 3) deploying is pretty nice! My trucks are discovered nicely with the custom trained model.
However, when converted to tflite the accuracy goes down no matter if I use tfliteconvert tool or using python tf.lite.Converter.
What is more, all detections are on borders of images, and usually in the bottom-right corner. Maybe I am not preparing images correctly? Or some misunderstanding of results?
You can check images I uploaded.
https://ibb.co/fSzfZvz
https://ibb.co/0GF101s
What could possibly go wrong?
I was lacking proper preprocessing of image.
After I have used pipeline config to build detection object which has preprocess function I utilized to build tensor before feeding it into Interpreter.
num_classes = 2
configs = config_util.get_configs_from_pipeline_file(pipeline_config)
model_config = configs['model']
model_config.ssd.num_classes = num_classes
model_config.ssd.freeze_batchnorm = True
detection_model = model_builder.build(
model_config=model_config, is_training=True)

Set batch size of trained keras model to 1

I am having a keras model trained on my own dataset. However after loading weights the summary shows None as the first dimension(the batch size).
I want to know the process to fix the shape to batch size of 1, as it is compulsory for me to fix it so i can convert the model to tflite with GPU support.
What worked for me was to specify batch size to the Input layer, like this:
input = layers.Input(shape=input_shape, batch_size=1, dtype='float32', name='images')
This then carried through the rest of the layers.
The bad news is that despite this "fix" the tfl runtime still complains about dynamic tensors. I get these non-fatal errors in logcat when it runs:
E/tflite: third_party/tensorflow/lite/core/subgraph.cc:801 tensor.data.raw != nullptr was not true.
E/tflite: Attempting to use a delegate that only supports static-sized tensors with a graph that has dynamic-sized tensors (tensor#26 is a dynamic-sized tensor).
E/tflite: Ignoring failed application of the default TensorFlow Lite delegate indexed at 0.
The good news is that despite these errors it seems to be using the GPU anyway, based on performance testing.
I'm using:
tensorflow-lite-support:0.2.0'
tensorflow-lite-metadata:0.2.1'
tensorflow-lite:2.6.0'
tensorflow:tensorflow-lite-gpu:2.3.0'
Hopefully, they'll fix the runtime so it doesn't matter whether the batch size is 'None'. It shouldn't matter for doing inference.

when to stop training object detection tensorflow

I am training faster rcnn model on fruit dataset using a pretrained model provided in google api(faster_rcnn_inception_resnet_v2_atrous_coco).
I made few changes to the default configuration. (number of classes : 12 fine_tune_checkpoint: path to the pretrained checkpoint model and from_detection_checkpoint: true). Total number of annotated images I have is around 12000.
After training for 9000 steps, the results I got have an accuracy percent below 1, though I was expecting it to be atleast 50% (In evaluation nothing is getting detected as accuracy is almost 0). The loss fluctuates in between 0 and 4.
What should be the number of steps I should train it for. I read an article which says to run around 800k steps but its the number of step when you train from scratch?
FC layers of the model are changed because of the different number of the classes but it should not effect those classes which are already present in the pre-trained model like 'apple'?
Any help would be much appreciated!
You shouldn't look at your training loss to determine when to stop. Instead, you should run your model through the evaluator periodically, and stop training when the evaluation mAP stops improving.

Tensorflow object-detection package error when change batch size

I am trying the newly released object-detection API for tensorflow.
I used the sample training program in the tutorial, i.e, fine-tuning the FRCNN-Resnet model on the pet dataset. Using only one GPU, an error message always show up when I change the batch size to a value more than 1 (default is 1). The error message looks like this:
InvalidArgumentError (see above for traceback): ConcatOp : Dimensions of inputs should match: shape[0] = [1,750,600,3] vs. shape[1] = [1,600,804,3]
You are likely using the keep_aspect_ratio_resizer which allows each image to be a different size, which means that you can only train with batch size 1. To train with larger batch sizes, the only way to currently handle this in the API is to use the fixed_shape_resizer. See some of the SSD configs for examples of this.