Google colab stops immediately after training yolov3-tiny - tensorflow

I'm currently trying to train tiny yolo weights.
I've already trained normal yolov3 weights but I want to make a live detector on a raspberry pi so I need the tiny ones.
The training of the normal ones went great no hiccups whatsoever, but the tiny weights just won't work.
I've tried like 4 different tutorials but the outcome is the same everytime.
Google colab just stops.
I also tried to train the normals again to test but also there it immediately stops.
-clear 1 after the command doesn't work and I've tried to modify the cfg in different ways but nothing. I don't know what to do anymore. Does anyone have an idea or a tip. That would be great.

Related

The model mistakes everything it knew from its pre-trained model as my custom object

I've followed an object detection tutorial from pythonprogramming.net to recognize a small robot (my custom object) based on the ssd_mobilenet_v1_coco model.
I've about 450 labelled images of my robot.
I used the official sample config for ssd_mobilenet_v1_coco, and only made the necessary changes like num_class = 1, and reduced the batch size to 7, and trained until I had a loss that was consistently between 1 and 2 (about 10000 epochs).
The problem is, the model detects everything it used to know from its pre-trained state as my small robot. So it identifies everything as being a robot even though they aren't.
I faced this issue before. And fixed it by adding images contains pre-trained objects as negative examples. Another way to fix it is training longer. If you do both that will fix the problem i think. And try increasing your dataset by the way (i was training with 6000 images).

Tensorflow Object Detection API w/ TPU Training - Display more granular Tensorboard plots

I've been following this tutorial on the Tensorflow Object Detection API, and I've successfully trained my own object detection model using Google's Cloud TPUs.
However, the problem is that on Tensorboard, the plots I'm seeing only have 2 data points each (so it just plots a straight line), like this:
...whereas I want to see more "granular" plots like these below, which are much more detailed:
The tutorial I've been following acknowledges that this issue is caused by the fact that TPU training requires very few steps to train:
Note that these graphs only have 2 points plotted since the model
trains quickly in very few steps (if you’ve used TensorBoard before
you may be used to seeing more of a curve here)
I tried adding save_checkpoints_steps=50 in the file model_tpu_main.py (see code fragment below), and when I re-ran training, I was able to get a more granular plot, with 1 data point every 300 steps or so.
config = tf.contrib.tpu.RunConfig(
# I added this line below:
save_checkpoints_steps=50,
master=tpu_grpc_url,
evaluation_master=tpu_grpc_url,
model_dir=FLAGS.model_dir,
tpu_config=tf.contrib.tpu.TPUConfig(
iterations_per_loop=FLAGS.iterations_per_loop,
num_shards=FLAGS.num_shards))
However, my training job is actually saving a checkpoint every 100 steps, rather than every 300 steps. Looking at the logs, my evaluation job is running every 300 steps. Is there a way I can make my evaluation job run every 100 steps (whenever there's a new checkpoint) so that I can get more granular plots on Tensorboard?
Code which addresses this issue is explained by a technical lead for the Google cloud platform in a Medium blogpost. Alternatively go directly to the Github code.
The train_and_evaluate function of 81 lines defines an TPUEstimator, train_input_fn and eval_input_fn. Then it iterates to the training steps and calls estimator.train and estimator.evaluate in each iteration. The metrics can be defined in the model_fn, which is called image_classifier. Note that it currently has no effect to add tf.summary calls in the model functions since the TPU does not support it:
"TensorBoard summaries are a great way see inside your model. A minimal set of basic summaries are automatically recorded by the TPUEstimator, to event files in the model_dir. Custom summaries, however, are currently unsupported when training on a Cloud TPU. So while the TPUEstimator will still run locally with summaries, it will fail if used on a TPU." (source)
If summaries are important it might be more convenient to switch to training on GPU.
Personally I think writing this code is quite a hassle for something which should be handled by the API. Please update this answer if better solutions exist! I'm looking forward to it.
Set save_summary_steps in RunConfig to 100, so you get the statistics you want
Also iterations_per_loop to 100 so that the training doesn't go more steps
p.s. I hope you realize that checkpointing is very slow. You are probably raising the cost of your job just for the sake of a pretty graph :)
You can try adding throttle_secs=100 to the EvalSpecs constructor here. The default is 600 seconds.

Tensorflow faster rcnn giving good detection but still detecting false positives with coco objects

I have used the tensorflow API to detect the Guinness harp using the process described here - https://pythonprogramming.net/introduction-use-tensorflow-object-detection-api-tutorial/.
I have mostly good results, whenever the logo is clear in the image it finds it nicely -
However, after retraining from a coco checkpoint, it still detects what I think are coco objects with a very high confidence rating i.e people, magazines. I cannot work out why this is is.
(see below)
I am using the faster_rcnn_inception_v2_coco.config found here - https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/faster_rcnn_inception_v2_coco.config
Training for more steps does not seem to help as the total loss averages out. The above screenshots were from 10,000 training steps. I am training on a cpu.
I am augmenting my training images using imgaug, and an example training image can be seen below ( i have included the debug bounding box around the target) -
However, if the training images were the problem, wouldn't the graph have trouble detecting the target altogether?
I had a similar issue recently, from what it somewhat looks like a case of underfitting, I tried multiple things to improve on the results.
The thing that worked for me was actually augmenting data using the library imgaug. You can augment the images as well as the bounding boxes using a simple script, try and increase the dataset by say 10/12 fold.
I would also suggest adding some background images, ie. images with no object, it was recommended by a few people in the tensorflow discussion in the issues.
Try and train the dataset again and monitor it using tensorboard. I think you will be able to reduce the number of false positives significantly.

Issue with Custom object detection using tensorflow when Training on a single type of object

I am training a pre built tensorflow based model for custom object detection.
I want to detect only 1 type of object. I have taken lot of images from different angles and in different light conditions. I am training on K80 Nvidia GPU. Everything is working and when I train I can see the loss function falling to 0.3. But the loss values drops very quickly to under 1 when I start training. I am using SSD mobile Net as the base configuration for the model. When I try to test the model, it just draws a big square on the input image, rather than detecting the desired object in the image. Basically, it fails to detect the object.
I tried to train the model with a different set of images of mac n chesse which had lot of variations. Then the model worked fine and detected images of mac n chesse in the input image. But when I have pictures of single object then the model fails to detect. Please help me understand what I am doing wrong here
The issue was with my training dataset. I was not properly cropping the object from the original image. Also I needed around 300 images to properly train the model. SSD worked well after giving a well cropped images.

Tensorflow Object Detection API - What's actually test.record being used for?

I have a few doubts about Tensorflow Object Detection API. Hopefully someone can help me out... Before that, I need to mention that I am following what sendex is doing. So basically, the steps are come from him.
First doubt: Why we need test.record for training? What it does during training?
Second doubt: Sendex is getting images from test.record to test the newly trained model, doesn't the model already knew that images because they are from test.record?
Third doubt: In what type of occasion we need to activate drop_out (in the .config file)?
1) It does nothing during training, you dont need that during training, but at certain time the model begins to overfit. It means the loss on training images continues to go down but the accuracy on testing images stops improving and begins to decline. This is the time when it is needed to stop traininga nd to recognise this moment you need the test.record.
2) Images were used only to evaluate model during training not to train the net.
3) You do not need to activate it, but using dropout you usually achieve higher accuracy. It prevents the net from overfitting.