I'm a user of Tensorflow Object Detection API. I use it a lot to train models like Faster-RCNN on my images. From what I understand, there is a backbone network (in my experiences, ResNet) used to extract features.
I would like to re-use the weight of this specific network, but when I save my model, it's a Faster RCNN model and, even after hours in the documentation and in the source files, I don't see how to isolate the weight of the backbone network.
Is something somebody else already has realized before ? Or is Tensorflow OD API not the right tool for what I need ?
Thank you for your help or advises !
Related
I'm looking into training an object detection network using Tensorflow, and I had a look at the TF2 Model Zoo. I noticed that there are noticeably less models there than in the directory /models/research/models/, including the MobileDet with SSDLite developed for the jetson xavier.
To clarify, the readme says that there is a MobileDet GPU with SSDLite, and that the model and checkpoints trained on COCO are provided, yet I couldn't find them anywhere in the repo.
How is one supposed to use those models?
I already have a custom-trained MobileDetv3 for image classification, and I was hoping to see a way to turn the network into an object detection network, in accordance with the MobileDetv3 paper. If this is not straightforward, training one network from scratch could be ok too, I just need to know where to even start from.
If you plan to use the object detection API, you can't use your existing model. You have to choose from a list of models here for v2 and here for v1
The documentation is very well maintained and the steps to train or validate or run inference (test) on custom data is very well explained here by the TensorFlow team. The link is meant for TensorFlow version v2. However, if you wish to use v1, the process is fairly similar and there are numerous blogs/videos explaining how to go about it
I am re-training the SSD MobileNet with 900 images from the Berkeley Deep Drive dataset, and eval towards 100 images from that dataset.
The problem is that after about 24 hours of training, the totalloss seems unable to go below 2.0:
And the corresponding mAP score is quite unstable:
In fact, I have actually tried to train for about 48 hours, and the TotoalLoss just cannot go below 2.0, something ranging from 2.5~3.0. And during that time, mAP is even lower..
So here is my question, given my situation (I really don't need any "high-precision" model, as you can see, I pick 900 images for training and would like to simply do a PoC model training/predication and that's it), when should I stop the training and obtain a reasonably performed model?
indeed for detection you need to finetune the network, since you are using SSD, there are already some sources out there:
https://gluon-cv.mxnet.io/build/examples_detection/finetune_detection.html (This one specifically for an SSD Model, uses mxnet but you can use the same with TF)
You can watch a very nice finetuning intro here
This repo has a nice fine tuning option enabled as long as you write your dataloader, check it out here
In general your error can be attributed to many factors, the learning rate you are using, the characteristics of the images themselves (are they normalized?) If the ssd network you are using was trained with normalized data and you don't normalize to retrain then you'll get stuck while learning. Also what learning rate are they using?
From the model zoo I can see that for SSD there are models trained on COCO
And models trained on Open Images:
If for example you are using ssd_inception_v2_coco, there is a truncated_normal_initializer in the input layers, so take that into consideration, also make sure the input sizes are the same that the ones you provide to the model.
You can get very good detections even with little data if you also include many augmentations and take into account the rest of the things I mentioned, more details on your code would help to see where the problem lies.
Open CV provides a simple API to detect and extract faces from given images. ( I do not think it works perfectly fine though because I experienced that it cuts frames from the input pictures that have nothing to do with face images. )
I wonder if tensorflow API can be used for face detection. I failed finding relevant information but hoping that maybe an experienced person in the field can guide me on this subject. Can tensorflow's object detection API be used for face detection as well in the same way as Open CV does? (I mean, you just call the API function and it gives you the face image from the given input image.)
You can, but some work is needed.
First, take a look at the object detection README. There are some useful articles you should follow. Specifically: (1) Configuring an object detection pipeline, (3) Preparing inputs and (3) Running locally. You should start with an existing architecture with a pre-trained model. Pretrained models can be found in Model Zoo, and their corresponding configuration files can be found here.
The most common pre-trained models in Model Zoo are on COCO dataset. Unfortunately this dataset doesn't contain face as a class (but does contain person).
Instead, you can start with a pre-trained model on Open Images, such as faster_rcnn_inception_resnet_v2_atrous_oid, which does contain face as a class.
Note that this model is larger and slower than common architectures used on COCO dataset, such as SSDLite over MobileNetV1/V2. This is because Open Images has a lot more classes than COCO, and therefore a well working model need to be much more expressive in order to be able to distinguish between the large amount of classes and localizing them correctly.
Since you only want face detection, you can try the following two options:
If you're okay with a slower model which will probably result in better performance, start with faster_rcnn_inception_resnet_v2_atrous_oid, and you can only slightly fine-tune the model on the single class of face.
If you want a faster model, you should probably start with something like SSDLite-MobileNetV2 pre-trained on COCO, but then fine-tune it on the class of face from a different dataset, such as your own or the face subset of Open Images.
Note that the fact that the pre-trained model isn't trained on faces doesn't mean you can't fine-tune it to be, but rather that it might take more fine-tuning than a pre-trained model which was pre-trained on faces as well.
just increase the shape of the input, I tried and it's work much better
Tensorflow object detection API provides a number of pretrained object detection models to choose from. However, I would like to introduce modifications to the architecture of those models.
Particularly, I would like to make Faster RCNN into a more shallow network and use it to train my model. I want to gain in performance despite loss in accuracy. MobileNet is too inaccurate for my application.
Is it possible to achieve this without having to implement everything from scratch ?
Thank you.
I have a question related to this one:
TensorFlow in production for real time predictions in high traffic app - how to use?
I want to setup TensorFlow Serving to do inference as a service for our other application. I see how TensorFlow Serving helps me to do that. Additionally, it mentions a continuous training pipeline, which probably is related to the possibility that TensorFlow Serving can serve with multiple versions of a trained model. But what I am not sure is how to retrain your model as you get new data. The other post mentions the idea to run retraining with cron jobs. However, I am not sure if automatic retraining is a good idea. What architecture would you propose for a continuous retraining pipeline with a system continuously facing new, labelled data?
Edit: It is a supervised learning case. The question is would you automatically retrain your model after n new datapoints came in or would you retrain during the downtime of the customer automatically or just retrain manually?
You probably want to use some kind of semi-supervised training. There's fairly extensive research in that area.
A crude, but expedient way, which works well, is to use the current best models that you have to label the new, incoming data. Models are typically able to produce a score (hopefully a logprob). You can use that score to only train on the data that fits well.
That is an approach that we have used in speech recognition and is an excellent baseline.