Is it possible to combine two different custom YOLOv4 models - object-detection

I'm working on an object detection project where I have to identify the type of animals and their posture given an image/video. For this purpose, I have two custom YOLOv4 models which are trained separately. Model 1 identifies the type of animal and Model 2 identifies the posture of the animal. I have converted these models to TensorFlow models.
Now, since both the models use the same image/video as input, I want to combine the outputs of both the models and the final output should display the bounding box of both the models.
I'm stuck at this point, I have been researching the solution for this and I'm confused with various methods. Could anyone help me with this?

I don't think that you need object detection model as pose identifier - because you've already localized the animal by 1st net.
The easiest (and clearly not very accurate) solution that I see is to use classifier on top of detections (crop bounding box as input) - but in that case the animal anatomy is not taken into account explicitly, but that approach is I guess still good baseline.
For further experiments you can take a look at these and these solutions with animal pose estimation, but they are more complex to use

Related

CoreML - Image Classifier vs Object Detection

I was wondering, which would be better for the following:
I want to create a model to help distinct car models, take the Mercedes C250 (2014) and Mercedes C63 (2014) as an example.
I understand, object helps to identify multiple well... objects in a given image, however, looking at a tutorial online and seeing how IBM cloud can allow you to annotate such specifics say the badge on the car, certain detailing etc. Would an object detection work better for me as opposed to just an image classifier?
I understand, the more data that is fed, the better the results, but in a general sense, what should be the approach? Image classifier or object detection? Or maybe something else? I've used and trained multiple image classifiers but I am not happy at all with the results.
Any help or suggestions would be much appreciated.
Object detection better because simple image classifier broke if you have more than one different cars at one photo.

How does custom object detection actually work?

I am currently testing out custom object detection using the Tensorflow API. But I don't quite seem to understand the theory behind it.
So if I for example download a version of MobileNet and use it to train on, lets say, red and green apples. Does it forget all the things that is has already been trained on? And if so, why does it then benefit to use MobileNet over building a CNN from scratch.
Thanks for any answers!
Does it forget all the things that is has already been trained on?
Yes, if you re-train a CNN previously trained on a large database with a new database containing fewer classes it will "forget" the old classes. However, the old pre-training can help learning the new classes, this is a training strategy called "transfert learning" of "fine tuning" depending on the exact approach.
As a rule of thumb it is generally not a good idea to create a new network architecture from scratch as better networks probably already exist. You may want to implement your custom architecture if:
You are learning CNN's and deep learning
You have a specific need and you proved that other architectures won't fit or will perform poorly
Usually, one take an existing pre-trained network and specialize it for their specific task using transfert learning.
A lot of scientific literature is available for free online if you want to learn. you can start with the Yolo series and R-CNN, Fast-RCNN and Faster-RCNN for detection networks.
The main concept behind object detection is that it divides the input image in a grid of N patches, and then for each patch, it generates a set of sub-patches with different aspect ratios, let's say it generates M rectangular sub-patches. In total you need to classify MxN images.
In general the idea is then analyze each sub-patch within each patch . You pass the sub-patch to the classifier in your model and depending on the model training, it will classify it as containing a green apple/red apple/nothing. If it is classified as a red apple, then this sub-patch is the bounding box of the object detected.
So actually, there are two parts you are interested in:
Generating as many sub-patches as possible to cover as many portions of the image as possible (Of course, the more sub-patches, the slower your model will be) and,
The classifier. The classifier is normally an already exisiting network (MobileNeet, VGG, ResNet...). This part is commonly used as the "backbone" and it will extract the features of the input image. With the classifier you can either choose to training it "from zero", therefore your weights will be adjusted to your specific problem, OR, you can load the weigths from other known problem and use them in your problem so you won't need to spend time training them. In this case, they will also classify the objects for which the classifier was training for.
Take a look at the Mask-RCNN implementation. I find very interesting how they explain the process. In this architecture, you will not only generate a bounding box but also segment the object of interest.

what neural network model is the most effective

at the moment I study neural network. I tried to use different models to recognize people and came across one very interesting question for me. I used yolo v3, mask r-cnn, but all of them in the photos taken from an indirect angle missed people in the photo. Which of the existing models is the most accurate and effective ?
This is the main problem with deep learning models. For every instance of an object you want to detect, there should be at least one similar object to it (in case of angle, size, color, shape, etc) in the training set. The more similar objects in the training data, the higher probability of the object to be detected.
In case of speed and accuracy, YOLO V3 is currently one of the best. Mask RCNN is also one of the best models if you want the exact boundaries of the object (segmentation). If there is no need for the exact boundaries of the objects, I would recommend using YOLO for its efficiency, You can work on your training data and try to add multiple instances of people with different sizes, angles, shapes, and also include cases of truncation and occlusion (when just parts of a person is visible) to get more generalization in the model's performance.

Retrain TF object detection API to detect a specific car model -- How to prepare the training data?

I am new to object detection and trying to retrain object-detection API in TensorFlow to detect a specific car model in photos. When preparing my own training data to retrain the model, besides things like drawing bounding boxes, etc, my question is, should I also prepare negative examples in the training data (cars that are not the model I am interested in) to reach good performance?
I have read through some tutorials and they usually give example in detecting one type of object, and they prepared training data with the label only for that type. I was thinking, since the model first proposal some area of interest, then try to classify those areas, should I also prepare negative examples if I want to detect very specific stuff from photos.
I am retaining faster_rcnn based model. Thanks for the help.
Yes, you will need negative examples also for better performance. Seems like are you thinking about using transfer learning to train a pre-trained faster_rcnn model to add a new class for your custom car. You should start an equal number of positive and negative examples (images with labelled bounding boxes). You will need have examples of several negative classes (e.g. negative car type 1, negative car type 2, negative car type 3) in addition to your target car type.
You can look at examples of one positive class and several negative classes training data for transfer learning in the data folder of the my github repo at: PSV Detector Github

How to create a class for non classified object in tensorflow?

Hi i have build my CNN with two classes dogs and cats, i have trained this and now i am able to classify dog and cat image. But what about if i want to introduce a class for new unclassified object? For example if i feed my network with a flower image's the network give me a wrong classification. I want to build my network with a third class for new unclassified object. But how can i build this third class. Which images i have to use to get class for new object that are different from dogs or cats?
Actually at the end of my network i use Softmax and my code is developed by using tensorflow. Could someone provide me some suggestion? Thanks
You need to add a third "something else" class to your network. There are several ways you can go about it. In general, if you have a class that you want to detect you should have examples for that class, so you could add images without cats or dogs to your training data labelled with the new class. However, this is a bit tricky, because the new class is, by definition, everything in the universe but dogs and cats, so you cannot possibly expect to have enough data to train for it. In practice, though, if you have enough examples the network will probably learn that the third class is triggered whenever the first two are not.
Another option that I have used in the past is to model the "default" class slightly different from the regular ones. So, instead of trying to actually learn what is a "not cat or dog" image, you can just explicitly say that it is just whatever does not activates the cat or dog neurons. I did this by replacing the last layer from softmax to a sigmoids (so the loss would be sigmoid cross-entropy instead of softmax cross-entropy, and the output would not be a categorical probability distribution anymore, but honestly it didn't make much difference performance-wise in my case), then express the "default" class as 1 minus the maximum activation value from every other class. So, if no class had an activation of 0.5 of greater (i.e. 50% estimated probability of being that class), the "default" class would be the highest scoring one. You can explore this an other similar schemes.
You should just add images to your dataset that are neither dogs nor cats, label them as "Other", and treat "Other" as normal class in all your code. In particular you'll get a softmax over 3 classes.
The images you're using can be anything (except cats and dogs of course), but should be of the same kind as the ones you'll probably be testing against when using your network. So for instance if you know you'll be testing on images of dogs, cats, and other animals, train with other animals, not with pictures of flowers. If you don't know what you'll be testing with, try to get very varied images, from different sources, etc, so that the network learns well that this class is "anything but cats and dogs" (the wide range of images in the real world that fall in this category should be reflected in your training dataset).