Training keras with tensorflow: Redundancy in labelling the object or multiple labels on same object

I was training keras with tensorflow for person detection. After the training, when the testing was done so many images contains redundant labeling of person. ie; for a single person in an image, multiple labeling as a person was shown. What is the actual reason behind this?
My training set contains nearly 2000 images, a single class person, batch=32, epoch=100, threshold=0.55 and testing images=250.

Overtraining of samples may lead to redundancy and if you are using different angles of an image, for example if you train for detecting people and you are providing samples of human from different angles, then it may show errors on detection in real cases. If this is not the issue, then non- maximal suppression will be the better option.


How does custom object detection actually work?

I am currently testing out custom object detection using the Tensorflow API. But I don't quite seem to understand the theory behind it.
So if I for example download a version of MobileNet and use it to train on, lets say, red and green apples. Does it forget all the things that is has already been trained on? And if so, why does it then benefit to use MobileNet over building a CNN from scratch.
Thanks for any answers!
Does it forget all the things that is has already been trained on?
Yes, if you re-train a CNN previously trained on a large database with a new database containing fewer classes it will "forget" the old classes. However, the old pre-training can help learning the new classes, this is a training strategy called "transfert learning" of "fine tuning" depending on the exact approach.
As a rule of thumb it is generally not a good idea to create a new network architecture from scratch as better networks probably already exist. You may want to implement your custom architecture if:
You are learning CNN's and deep learning
You have a specific need and you proved that other architectures won't fit or will perform poorly
Usually, one take an existing pre-trained network and specialize it for their specific task using transfert learning.
A lot of scientific literature is available for free online if you want to learn. you can start with the Yolo series and R-CNN, Fast-RCNN and Faster-RCNN for detection networks.
The main concept behind object detection is that it divides the input image in a grid of N patches, and then for each patch, it generates a set of sub-patches with different aspect ratios, let's say it generates M rectangular sub-patches. In total you need to classify MxN images.
In general the idea is then analyze each sub-patch within each patch . You pass the sub-patch to the classifier in your model and depending on the model training, it will classify it as containing a green apple/red apple/nothing. If it is classified as a red apple, then this sub-patch is the bounding box of the object detected.
So actually, there are two parts you are interested in:
Generating as many sub-patches as possible to cover as many portions of the image as possible (Of course, the more sub-patches, the slower your model will be) and,
The classifier. The classifier is normally an already exisiting network (MobileNeet, VGG, ResNet...). This part is commonly used as the "backbone" and it will extract the features of the input image. With the classifier you can either choose to training it "from zero", therefore your weights will be adjusted to your specific problem, OR, you can load the weigths from other known problem and use them in your problem so you won't need to spend time training them. In this case, they will also classify the objects for which the classifier was training for.
Take a look at the Mask-RCNN implementation. I find very interesting how they explain the process. In this architecture, you will not only generate a bounding box but also segment the object of interest.

what neural network model is the most effective

at the moment I study neural network. I tried to use different models to recognize people and came across one very interesting question for me. I used yolo v3, mask r-cnn, but all of them in the photos taken from an indirect angle missed people in the photo. Which of the existing models is the most accurate and effective ?
This is the main problem with deep learning models. For every instance of an object you want to detect, there should be at least one similar object to it (in case of angle, size, color, shape, etc) in the training set. The more similar objects in the training data, the higher probability of the object to be detected.
In case of speed and accuracy, YOLO V3 is currently one of the best. Mask RCNN is also one of the best models if you want the exact boundaries of the object (segmentation). If there is no need for the exact boundaries of the objects, I would recommend using YOLO for its efficiency, You can work on your training data and try to add multiple instances of people with different sizes, angles, shapes, and also include cases of truncation and occlusion (when just parts of a person is visible) to get more generalization in the model's performance.

Using Tensorflow Object Detection API: RPN losses keep increasing. Are there ways to make RPN losses decrease?

I am using Tensorflow Object Detection API for fine-tuning, using my own data. The goal is to detect 2 classes of objects. I am using the pre-trained faster_rcnn_resnet101_coco model.
The various detection box precision and recall measures are generally increasing (see screenshots below) and are fairly high:
The box classifier losses are decreasing. HOWEVER, the RPN losses are increasing (see screenshots below) -- It looks that the model is having a hard time distinguishing foregrounds from backgrounds (hence, the increasing RPN losses), but once the model is able to identify and locate the right foreground, it classifies well (hence, the decreasing box classifier losses)? I think this can be observed in the model's performance on test images: the false positive rate (on images that do not contain any of the two classes of target objects) is rather high. On the other hand, on images that do contain those target objects, the model does a fantastic job in accurately identifying and locating those objects.
So my question is essentially: what are some of the things I could try to help make sure RPN losses are also decreasing.

Training different objects using tensorflow Object detection API

I recently came across this link for learning tensorflow object detection
However I have few doubts and want suggestion on how to proceed.
1) How should I train different objects using the same model( I mean what should my data set contain if I want to train cats,dogs as objects.
2) and once I have trained it for dogs and then continue training on cars will the model detect dogs?
Your dataset should contain a large variety of examples for every object (class) you wish to detect. It sounds like you're misunderstanding the training process by assuming that you train it on each class of objects in sequence, this is incorrect. When you train the model you will be taking a random batch of samples (maybe 64 for example) across all classes.
Training simultaneously on all or many of the classes makes sense, you have one model that has to perform equally well on all classes. So when you train the model you compute the error of the parameters with respect to a random selection of classes and average the error to come up with each update step, yielding a model that performs well across classes.
Notice that it's quite common to run into class imbalance issues. If you have only a few samples of cats, and millions of samples of dogs you will disproportionately penalize the network for misclassifying dogs as cats and the network will simply always predict dog to hedge its bet. Ideally, you will have a roughly equal balance of data per class, if not, there are books and tutorials galore on the strategies to deal with this.

Can I find the region of the found categories in TensorFlow?

We have been using Tensorflow for image classification, and we all see the results for the Admiral Grace Hopper, and we get:
military uniform (866): 0.647296
suit (794): 0.0477196
academic gown (896): 0.0232411
bow tie (817): 0.0157356
bolo tie (940): 0.0145024
I was wondering if there is any way to get the coordinates for each category within the image.
Tensorflow doesn't have sample code yet on image detection and localization but it's an open research problem with different approaches to do it using deep nets; for example you can lookup the papers on algorithms called OverFeat and YOLO (You Only Look Once).
Also, usually there's some preprocessing on the object coordinates labels, or postprocessing to suppress duplicate detections. Usually a second, different network is used to classify the object once it's detected.