Training SSD-MOBILENET V1 and the loss does not deacrease - tensorflow

I'm new in everithing about CNN and tensorflow. Im training a pretrained ssd-mobilenev1-pets.config to detect columns of buildings, about one day but the loss is between 2-1 and doesnt decrease since 10 hours ago.
I realized that my input images are 128x128 and SSD resize de image to 300*300.
Does the size of the input images affect the training?
If that is the case, should I retrain the network with larger input images? or what would be another option to decrease the loss? my train dataset has 660 images and test 166 I dont Know if there are enough images
I really aprecciate your help ....

Loss values of ssd_mobilenet can be different from faster_rcnn. From EdjeElectronics' TensorFlow Object Detection Tutorial:
For my training on the Faster-RCNN-Inception-V2 model, it started at
about 3.0 and quickly dropped below 0.8. I recommend allowing your
model to train until the loss consistently drops below 0.05, which
will take about 40,000 steps, or about 2 hours (depending on how
powerful your CPU and GPU are). Note: The loss numbers will be
different if a different model is used. MobileNet-SSD starts with a
loss of about 20, and should be trained until the loss is consistently
under 2.
For more information: https://github.com/EdjeElectronics/TensorFlow-Object-Detection-API-Tutorial-Train-Multiple-Objects-Windows-10#6-run-the-training
The SSD Mobilnet architecture demands additional training to suffice
the loss accuracy values of the R-CNN model, however, offers
practicality, scalability, and easy accessibility on smaller devices
which reveals the SSD model as a promising candidate for further
assessment (Fleury and Fleury, 2018).
For more information: Fleury, D. & Fleury, A. (2018). Implementation of Regional-CNN and SSD machine learning object detection architectures for the real time analysis of blood borne pathogens in dark field microscopy. MDPI AG.

I would recommend you to take 15%-20% images for testing which cover all the variety present in training data. As you said you have 650+ images for training and 150+ for testing. That is roughly 25% of testing images. It looks like you have enough images to start with. I know the more, the merrier but make sure your model also has sufficient data to learn from!
Resizing the images does not contribute to the loss. It makes sure there is consistency across all images for the model to recognize them without bias. The loss has nothing to do with image resizing as long as every image is resized identically.
You have to make stops and recover checkpoints again and again if you want your model to be perfectly fit. Usually, you can get away with good accuracy by re-training the ssd mobilenet until the loss consistently becomes under 1.Ideally we want the loss to be as lower as possible but we want to make sure the model is not over-fitting. It is all about trial and error. (Loss between 0.5 and 1 seems to be doing the job well but again it all depends on you.)
The reason I think your model is underperforming is due to the fact that you have variety of testing data and not enough training data to suffice.
The model has not been given enough knowledge in training data to make the model learn for new variety of testing data. (For example : Your test data has some images of new angles of buildings which are not sufficiently present in training data). In that case, I recommend you to put variety of all images in training data and then picking images to test making sure you still have sufficient training data of new postures. That's why I recommend you to take 15%-20% test data.

Related

How to improve the performance of CNN Model for a specific Dataset? Getting Low Accuracy on both training and Testing Dataset

We were given an assignment in which we were supposed to implement our own neural network, and two other already developed Neural Networks. I have done that and however, this isn't the requirement of the assignment but I still would want to know that what are the steps/procedure I can follow to improve the accuracy of my Models?
I am fairly new to Deep Learning and Machine Learning as a whole so do not have much idea.
The given dataset contains a total of 15 classes (airplane, chair etc.) and we are provided with about 15 images of each class in training dataset. The testing dataset has 10 images of each class.
Complete github repository of my code can be found here (Jupyter Notebook file): https://github.com/hassanashas/Deep-Learning-Models
I tried it out with own CNN first (made one using Youtube tutorials).
Code is as follows,
X_train = X_train/255.0
model = Sequential()
model.add(Conv2D(64, (3, 3), input_shape = X_train.shape[1:]))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(128, (3, 3)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(64))
model.add(Dense(16)) # added 16 because it model.fit gave error on 15
model.add(Activation('softmax'))
For the compiling of Model,
from tensorflow.keras.optimizers import SGD
model.compile(loss='sparse_categorical_crossentropy',
optimizer=SGD(learning_rate=0.01),
metrics=['accuracy'])
I used sparse categorical crossentropy because my "y" label was intenger values, ranging from 1 to 15.
I ran this model with following way,
model_fit = model.fit(X_train, y_train, batch_size=32, epochs=30, validation_split=0.1)
It gave me an accuracy of 0.2030 on training dataset and only 0.0733 on the testing dataset (both the datasets are present in the github repository)
Then, I tried out the AlexNet CNN (followed a Youtube tutorial for its code)
I ran the AlexNet on the same dataset for 15 epochs. It improved the accuracy on training dataset to 0.3317, however accuracy on testing dataset was even worse than my own CNN, at only 0.06
Afterwards, I tried out the VGG16 CNN, again following a Youtube Tutorial.
I ran the code on Google Colab for 10 Epochs. It managed to improve to 100% accuracy on training dataset in the 8th epoch. But this model gave the worst accuracy of all three on testing dataset with only 0.0533
I am unable to understand this contrasting behavior of all these models. I have tried out different epoch values, loss functions etc. but the current ones gave the best result relatively. My own CNN was able to get to 100% accuracy when I ran it on 100 epochs (however, it gave very poor results on the testing dataset)
What can I do to improve the performance of these Models? And specifically, what are the few crucial things that one should always try to follow in order to improve efficiency of a Deep Learning Model? I have looked up multiple similar questions on Stackoverflow but almost all of them were working on datasets provided by the tensorflow like mnist dataset and etc. and I didn't find much help from those.
Disclaimer: it's been a few years since I've played with CNNs myself, so I can only pass on some general advice and suggestions.
First of all, I would like to talk about the results you've gotten so far. The first two networks you've trained seem to at least learn something from the training data because they perform better than just randomly guessing.
However: the performance on the test data indicates that the network has not learned anything meaningful because those numbers suggest the network is as good as (or only marginally better than) a random guess.
As for the third network: high accuracy for training data combined with low accuracy for testing data means that your network has overfitted. This means that the network has memorized the training data but has not learned any meaningful patterns.
There's no point in continuing to train a network that has started overfitting. So once the training accuracy increases and testing accuracy decreases for a few epochs consecutively, you can stop training.
Increase the dataset size
Neural networks rely on loads of good training data to learn patterns from. Your dataset contains 15 classes with 15 images each, that is very little training data.
Of course, it would be great if you could get hold of additional high-quality training data to expand your dataset, but that is not always feasible. So a different approach is to artificially expand your dataset. You can easily do this by applying a bunch of transformations to the original training data. Think about: mirroring, rotating, zooming, and cropping.
Remember to not just apply these transformations willy-nilly, they must make sense! For example, if you want a network to recognize a chair, do you also want it to recognize chairs that are upside down? Or for detecting road signs: mirroring them makes no sense because the text, numbers, and graphics will never appear mirrored in real life.
From the brief description of the classes you have (planes and chairs and whatnot...), I think mirroring horizontally could be the best transformation to apply initially. That will already double your training dataset size.
Also, keep in mind that an artificially inflated dataset is never as good as one of the same size that contains all authentic, real images. A mirrored image contains much of the same information as its original, we merely hope it will delay the network from overfitting and hope that it will learn the important patterns instead.
Lower the learning rate
This is a bit of side note, but try lowering the learning rate. Your network seems to overfit in only a few epochs which is very fast. Obviously, lowering the learning rate will not combat overfitting but it will happen more slowly. This means that you can hopefully find an epoch with better overall performance before overfitting takes place.
Note that a lower learning rate will never magically make a bad-performing network good. It's just one way to locate a set of parameters that performs a tad bit better.
Randomize the training data order
During training, the training data is presented in batches to the network. This often happens in a fixed order over all iterations. This may lead to certain biases in the network.
First of all, make sure that the training data is shuffled at least once. You do not want to present the classes one by one, for example first all plane images, then all chairs, etc... This could lead to the network unlearning much of the first class by the end of each epoch.
Also, reshuffle the training data between epochs. This will again avoid potential minor biases because of training data order.
Improve the network design
You've designed a convolutional neural network with only two convolution layers and two fully connected layers. Maybe this model is too shallow to learn to differentiate between the different classes.
Know that the convolution layers tend to first pick up small visual features and then tend to combine these in higher level patterns. So maybe adding a third convolution layer may help the network identify more meaningful patterns.
Obviously, network design is something you'll have to experiment with and making networks overly deep or complex is also a pitfall to watch out for!

Is there any way to resize input shapes in SNPE (dlc)?

I have trained a model based on Tensorflow. This model is supposed to work on the mobile phone but I have got a problem when converting froze graph (pb) to deep learning container(dlc). I have to set the input size to be constant. This cause that model can't work with any input size.
I am trying to find a way that resizes input shapes of a DLC model without initializing model with "snpe-tensorflow-to-dlc --input_dims 1,512,512,3" because this way is consuming.
Actually, I want to resize input shapes in dlc model. can anybody help me?
Usually deployment solutions work with fixed input shapes because they assume some widely acknowledged usage model - resize all picture of the same certain size and do inference. And due to this usage model, developers of deployment solutions do not prioritize model loading time while they usually prioritize the inference time. The same happens in SNPE, in OpenVINO, in TFLite, etc.
To illustrate the times, here is some results from Snapdragon 820. To load Inception v3 to CPU takes 715ms, to load model to DSP takes 3 seconds. Inference on CPU takes 1 sec, inference on DSP takes 100ms. You see that loading time on DSP is bigger than on CPU, but inference time is much much better.
At the same time, usually it is allowed to change a shape before loading of the model assuming that all input pictures will have different size (but again, same for all pictures) than shapes for which model was trained. For SNPE it is SNPEBuilder::setInputDimensions
If model allow to do reshape and if no bugs in SNPE implementation, the model can be reshaped and loaded.
Not sure if your usage model fits to the vision described in the first paragraph. At the same time, to have a benefit from different input size you need to develop special topology that unlikely be supported by SNPE. If you take just regular SSD and reshape it to different size and measure accuracy on validation set, the most likely you get the best result on shpaes where model was trained.

When should I stop the object detection model training while mAP are not stable?

I am re-training the SSD MobileNet with 900 images from the Berkeley Deep Drive dataset, and eval towards 100 images from that dataset.
The problem is that after about 24 hours of training, the totalloss seems unable to go below 2.0:
And the corresponding mAP score is quite unstable:
In fact, I have actually tried to train for about 48 hours, and the TotoalLoss just cannot go below 2.0, something ranging from 2.5~3.0. And during that time, mAP is even lower..
So here is my question, given my situation (I really don't need any "high-precision" model, as you can see, I pick 900 images for training and would like to simply do a PoC model training/predication and that's it), when should I stop the training and obtain a reasonably performed model?
indeed for detection you need to finetune the network, since you are using SSD, there are already some sources out there:
https://gluon-cv.mxnet.io/build/examples_detection/finetune_detection.html (This one specifically for an SSD Model, uses mxnet but you can use the same with TF)
You can watch a very nice finetuning intro here
This repo has a nice fine tuning option enabled as long as you write your dataloader, check it out here
In general your error can be attributed to many factors, the learning rate you are using, the characteristics of the images themselves (are they normalized?) If the ssd network you are using was trained with normalized data and you don't normalize to retrain then you'll get stuck while learning. Also what learning rate are they using?
From the model zoo I can see that for SSD there are models trained on COCO
And models trained on Open Images:
If for example you are using ssd_inception_v2_coco, there is a truncated_normal_initializer in the input layers, so take that into consideration, also make sure the input sizes are the same that the ones you provide to the model.
You can get very good detections even with little data if you also include many augmentations and take into account the rest of the things I mentioned, more details on your code would help to see where the problem lies.

Tensorflow object detection: why is the location in image affecting detection accuracy when using ssd mobilnet v1?

I'm training a model to detect meteors within a picture of the night sky and I have a fairly small dataset with about 85 images and each image is annotated with a bounding box. I'm using the transfer learning technique starting with the ssd_mobilenet_v1_coco_11_06_2017 checkpoint and Tensorflow 1.4. I'm resizing images to 600x600pixels during training. I'm using data augmentation in the pipeline configuration to randomly flip the images horizontally, vertically and rotate 90 deg. After 5000 steps, the model converges to a loss of about 0.3 and will detect meteors but it seems to matter where in the image the meteor is located. Do I have to train the model by giving examples of every possible location? I've attached a sample of a detection run where I tiled a meteor over the entire image and received various levels of detection (filtered to 50%). How can I improve this?detected meteors in image example
It could very well be your data and I think you are making a prudent move by improving the heterogeneity of your dataset, BUT it could also be your choice of model.
It is worth noting that ssd_mobilenet_v1_coco has the lowest COCO mAP relative to the other models in the TensorFlow Object Detection API model zoo. You aren't trying to detect a COCO object, but the mAP numbers are a reasonable aproximation for generic model accuracy.
At the highest possible level, the choice of model is largely a tradeoff between speed/accuracy. The model you chose, ssd_mobilenet_v1_coco, favors speed over accuracy. Consequently, I would reccomend you try one of the Faster RCNN models (e.g., faster_rcnn_inception_v2_coco) before you spend a signifigant amount of time preprocessing images.

Selecting tensorflow object detection API training hyper parameters

I am setting up an object detection pipeline based on recently released tensorflow object detection API. I am using the arXiv as guidance. I am looking to understand the below for training on my own dataset.
It is not clear how they selected the learning rate schedules and how that would change based on the number of GPUs available for training. How do the training rate schedule change based on number of GPU's available for training? The paper mentions 9 GPUs are used. How should I change the training rate if I only want to use 1 GPU?
The released sample training config file for Pascal VOC using Faster R-CNN has initial learning rate = 0.0001. This is 10x lower than what was published in the original Faster-RCNN paper. Is this due to an assumption on the number of GPU's available for training or due to a different reason?
When I start training from the COCO detection checkpoint, how should the training loss decrease? Looking at tensorboard, on my dataset training loss is low - between 0.8 to 1.2 per iteration (with batch size of 1). Below image shows the various losses from tensorboard. . Is this expected behavior?
For questions 1 and 2: our implementation differs in a few small details compared to the original paper and internally we train all of our detectors with asynchronous SGD with ~10 GPUs. Our learning rates are calibrated for this setting (which you will also have if you decide to train via Cloud ML Engine as in the Pets walkthrough). If you use another setting, you will have to do a bit of hyperparameter exploration. For a single GPU, leaving the learning rate alone probably won't hurt performance, but you may be able to get faster convergence by increasing it.
For question 3: Training losses decrease erratically and you can only see the decrease if you smooth the plots quite a bit over time. Moreover, it's hard to explicitly say how well you are doing with respect to eval metrics just by looking at the training losses. I recommend looking at the mAP plots over time as well as the image visualizations to really get an idea of whether your model has "lifted off".
Hope this helps.