How to train model in Colab with no effect of interruption to the internet? - google-colaboratory

I am working on training some deep learning mode. It takes several hours to train such model in Google Colaboratory. I need to remain online full hours to train the model successively. Is there any solution to make Google colab train the model and if any internet interruption occurs, make no hamper to the training. Otherwise I need to train from the start.

Related

Google colab stops immediately after training yolov3-tiny

I'm currently trying to train tiny yolo weights.
I've already trained normal yolov3 weights but I want to make a live detector on a raspberry pi so I need the tiny ones.
The training of the normal ones went great no hiccups whatsoever, but the tiny weights just won't work.
I've tried like 4 different tutorials but the outcome is the same everytime.
Google colab just stops.
I also tried to train the normals again to test but also there it immediately stops.
-clear 1 after the command doesn't work and I've tried to modify the cfg in different ways but nothing. I don't know what to do anymore. Does anyone have an idea or a tip. That would be great.

Neural Network Memory

Using Google Colab, I made a neural network using tensorflow that generates text based on examples. I ran 60 epochs. How can I get my neural network to maintain what it has learned. Whenever I re-run it, it starts over.
Try saving your model at the end of the training like this:
import tensorflow as tf
tf.keras.models.save_model(model, 'model/my_model')
Then you can load the model like this:
tf.keras.models.load_model('model/my_model')

When should I stop the object detection model training while mAP are not stable?

I am re-training the SSD MobileNet with 900 images from the Berkeley Deep Drive dataset, and eval towards 100 images from that dataset.
The problem is that after about 24 hours of training, the totalloss seems unable to go below 2.0:
And the corresponding mAP score is quite unstable:
In fact, I have actually tried to train for about 48 hours, and the TotoalLoss just cannot go below 2.0, something ranging from 2.5~3.0. And during that time, mAP is even lower..
So here is my question, given my situation (I really don't need any "high-precision" model, as you can see, I pick 900 images for training and would like to simply do a PoC model training/predication and that's it), when should I stop the training and obtain a reasonably performed model?
indeed for detection you need to finetune the network, since you are using SSD, there are already some sources out there:
https://gluon-cv.mxnet.io/build/examples_detection/finetune_detection.html (This one specifically for an SSD Model, uses mxnet but you can use the same with TF)
You can watch a very nice finetuning intro here
This repo has a nice fine tuning option enabled as long as you write your dataloader, check it out here
In general your error can be attributed to many factors, the learning rate you are using, the characteristics of the images themselves (are they normalized?) If the ssd network you are using was trained with normalized data and you don't normalize to retrain then you'll get stuck while learning. Also what learning rate are they using?
From the model zoo I can see that for SSD there are models trained on COCO
And models trained on Open Images:
If for example you are using ssd_inception_v2_coco, there is a truncated_normal_initializer in the input layers, so take that into consideration, also make sure the input sizes are the same that the ones you provide to the model.
You can get very good detections even with little data if you also include many augmentations and take into account the rest of the things I mentioned, more details on your code would help to see where the problem lies.

Tensorflow Object Detection API - What's actually test.record being used for?

I have a few doubts about Tensorflow Object Detection API. Hopefully someone can help me out... Before that, I need to mention that I am following what sendex is doing. So basically, the steps are come from him.
First doubt: Why we need test.record for training? What it does during training?
Second doubt: Sendex is getting images from test.record to test the newly trained model, doesn't the model already knew that images because they are from test.record?
Third doubt: In what type of occasion we need to activate drop_out (in the .config file)?
1) It does nothing during training, you dont need that during training, but at certain time the model begins to overfit. It means the loss on training images continues to go down but the accuracy on testing images stops improving and begins to decline. This is the time when it is needed to stop traininga nd to recognise this moment you need the test.record.
2) Images were used only to evaluate model during training not to train the net.
3) You do not need to activate it, but using dropout you usually achieve higher accuracy. It prevents the net from overfitting.

Tensorflow Retrain the retrained model

I am very new to Neural network and tensorflow, just starting on the retrain image tutorial. I have successfully completed the flower_photos training and i have 2 questions.
1.) Is it a good/bad idea to keep building upon a retrained model many times over and over? Or would it be a lot better to train a model fresh everytime? That leads to my second question
2.) If it is ok to retrain a model over and over, for the retrain model tutorial in Tensorflow (Image_retraining), in the retrain.py would i simply replace the classify_image_graph_def.pb and imagenet_synset_to_human_label_map.txt with the one outputted from my retraining? But i see that there is also a imagenet_2012_challenge_label_map_proto.pbtxt, would i have to replace that one with something else?
Thanks for your time