Yolo v3 object detected results are different - yolo

I want to know when do I have to sell or buy the stock.
Based on https://github.com/gunaytemur/With-Yolov3-Buy-Sell-on-Candlestick
and https://pjreddie.com/darknet,
I try to detect candlestock pattern and decide when should I buy or sell the stock.
--What I tried-
I used cfg file in first link, trained the model with image trainset in first link, and follow the tutorial in pjreddie's website.
Epoch: 50000. training loss: 0.83
image size: 1800 x 650
train set # image: 440 of images
--Problem--
These three images are same image, but the detection results are different. Sometimes nothing is detected, or buy/sell patterns are detected but probability, class(sell or buy), object box position is different.
I'm thankful if anyone know reason why detection result is different for same image.

Related

Yolo Training: multiple objects in one image

I have a set of training images that contain many small objects (10-20). The image resolution is high (9000x6000).
Is it better to split the image into the specific objects before running yolo training? Or just leave it as is.
Does yolo resize an entire image, or does it ‘extract’ the annotated object first before resizing?
If it is the former, I am concerned that the resolution will be bad. Imagine 20 objects in a 416x416 image.
Does yolo resize an entire image, or does it ‘extract’ the annotated
object first before resizing?
Yes, an entire image will be resized in case of Yolo and it does not extract annotated object before resizing.
Since your input images have very high resolution, what you can do is:
Yolo can handle object sizes of 25 x 25 effectively with network input layer size 608 x 608. So if your object sizes in original input image are greater than 250 x 250 you can train the images as they are (with 608 x 608 network size). In that case even when images are resized to network size, objects will be of size greater than 25x25. This should give you good accuracy.
(6000/600) * 25 = 250
If object sizes in original images are smaller than 200 x 200, split your input image into 8 smaller units/blocks, say blocks/tiles of 2250 x 1500. Train these blocks as individual images. Each bigger image (9000 x 6000) corresponds to 8 training images. Each image might contain zero to many objects. You can operate in sliding window method.
The method you choose for training should be used for inference as well.
For training on objects of all sizes use following models: [Use this if you use original image as it is used for training]
Yolov4-custom
Yolov3-SPP
Yolov3_5l
If all of the objects that you want to detect are of smaller size, then for effective detection use Yolov4 with following changes: [Use this if you split original image into 8 blocks]
Set layers = 23 instead of layers = 54
Set stride=4 instead of stride=2
Set stride=4 instead of stride=2
References:
Refer this relevant GitHub thread
darknet documentation

tensorflowjs object detection coco ssd config file max_total_detections explanation

can someone quickly explain or confirm my guess on what i.e.:
max_detections_per_class: 100
max_total_detections: 100
in my case ssdlite_mobilenet_v2_coco.config
at line 134 and 135.
From the raw output of predictions, my guess is that the predictions always "tries" to detect 100 objects in the image, despite the actual number of objects in the image. Let's say there is only one cat, there still will be 100 objects detected in my returning raw prediction data. If the model is trained right, of course there should be only one prediction with a high score.
Is that correct?
Thank you!
Yes you are correct.
It will try to detect 100 objects.
Those 100 detections will be then classified, where only one should be correctly identified as a cat.
But it also depends on your Non Max Supression config. If NMS has a low score threshold and a high IoU, it can show multiple detections for the cat (overlapped detections I mean).
You can mess with those values, but from the published papers, the number max detections per image should always be more ~3x more than the actual objects in the image.
In my experience I obtain better results leaving it as is, even if in my data I has less objects in the image (eg 1 per image).

Tensorflow Object Detection Unusually large bounding boxes and wrong results

I am building an object detector in TensorFlow to detect, motorbike riders with and without helmet, I have 1000 Images each for riders with helmet, withouthelmet and pedestrians(pu together -- 3000 IMAGES), My last checkpoint was 35267 steps, I have tested using a traffic video, but I see unusally large bounding boxes with wrong results. Can someone please explain the reason for such detections? Do I need to wait for atleast 50000 steps?? or Do I need to add datasets(Images in the angle to Traffic Cameras)?
Model - SSD Mobilenet COCO - Custom Object Detection,
Training Platform - Google Colab
Please find the Images attachedVideo Snapshot 1
Video Snapshot 2
Day 2 - 10/30/2018
I have tested with Images today, I have got different results, seems to be correct,2nd Day if I test with single object in a Image. Please find the results
Single Object IMage Test 1
Single Object Image Test 2
Tested CHeckpoint - 52,000 Steps
But, If I test with the Images with multiple objects in a road, the detection is wrong and bounding boxes are weirdly bigger, Is it because of the dataset, as I am training with One Motorbike rider(with or with out helmet) per image.
Please find the wrong results
Multi Object Image Test
Multi Object Image Test
I had also tested with images like all Motorbikes in the scene, In this case, I did not get any results, Please find the Images
No Result Image
No Result Image
The results are very confusing, Is there anything I am missing?,
There is no need to wait till 50000 epocs you should get decent result in 35k or even in 10k. I would suggest
go through you data-set again and check all the bounding boxes (data cleaning)
Check your model with inference code for changes like batch normalization etc
Add some more data with different features, angles and color complexities
I would check these points before going further.

How does Nvidia Digits batch size and data shuffling work?

I am trying to train a neural network to detect steganographic images using Tensorflow and Nvidia Digits. I loaded a data set which has two sub directories - Cover Images and Steg Images. I think the network has to process the cover/stegano image pairs together to learn which are the covers and which are steganographic images. Am I correct?
How does batch size work? If I give 1 does it take one image from both sub directories and process them? or do I have to input batch number as 2 for that?
How does shuffling data on each epoch work? does it shuffle both sub directories equally? as an example will 1.jpg be the third photo on both folders or will it be different on them both?
I think the network has to process the cover/stegano image pairs
together to learn which are the covers and which are steganographic
images. Am I correct?
I am not familiar with object detection (right?) in Nvidia Digits, so please check out their tutorials for more information.
You need to think about the kind of labeling the training data first. Usually in the examples I see only use one training folder and one validation folder (each: images and labels) - Digits divides your dataset, e.g. into 90 % training and 10 % validation images.
How does batch size work? If I give 1 does it take one image from both
sub directories and process them? or do I have to input batch number
as 2 for that?
With batch number you tell Digits how many images you use per iteration. It's used for dataset division (memory for calculations is limited; you can't fit the whole dataset into one iteration). In one epoch the whole dataset is processed.
As written above, one image at a time, as far as I know.
How does shuffling data on each epoch work? does it shuffle both sub
directories equally? as an example will 1.jpg be the third photo on
both folders or will it be different on them both?
The data should be shuffled automatically.

TensorFlow Object Detection API: evaluation mAP behaves weirdly?

I am training an object detector for my own data using Tensorflow Object Detection API. I am following the (great) tutorial by Dat Tran https://towardsdatascience.com/how-to-train-your-own-object-detector-with-tensorflows-object-detector-api-bec72ecfe1d9. I am using the provided ssd_mobilenet_v1_coco-model pre-trained model checkpoint as the starting point for the training. I have only one object class.
I exported the trained model, ran it on the evaluation data and looked at the resulted bounding boxes. The trained model worked nicely; I would say that if there was 20 objects, typically there were 13 objects with spot on predicted bounding boxes ("true positives"); 7 where the objects were not detected ("false negatives"); 2 cases where problems occur were two or more objects are close to each other: the bounding boxes get drawn between the objects in some of these cases ("false positives"<-of course, calling these "false positives" etc. is inaccurate, but this is just for me to understand the concept of precision here). There are almost no other "false positives". This seems much better result than what I was hoping to get, and while this kind of visual inspection does not give the actual mAP (which is calculated based on overlap of the predicted and tagged bounding boxes?), I would roughly estimate the mAP as something like 13/(13+2) >80%.
However, when I run the evaluation (eval.py) (on two different evaluation sets), I get the following mAP graph (0.7 smoothed):
mAP during training
This would indicate a huge variation in mAP, and level of about 0.3 at the end of the training, which is way worse than what I would assume based on how well the boundary boxes are drawn when I use the exported output_inference_graph.pb on the evaluation set.
Here is the total loss graph for the training:
total loss during training
My training data consist of 200 images with about 20 labeled objects each (I labeled them using the labelImg app); the images are extracted from a video and the objects are small and kind of blurry. The original image size is 1200x900, so I reduced it to 600x450 for the training data. Evaluation data (which I used both as the evaluation data set for eval.pyand to visually check what the predictions look like) is similar, consists of 50 images with 20 object each, but is still in the original size (the training data is extracted from the first 30 min of the video and evaluation data from the last 30 min).
Question 1: Why is the mAP so low in evaluation when the model appears to work so well? Is it normal for the mAP graph fluctuate so much? I did not touch the default values for how many images the tensorboard uses to draw the graph (I read this question: Tensorflow object detection api validation data size and have some vague idea that there is some default value that can be changed?)
Question 2: Can this be related to different size of the training data and the evaluation data (1200x700 vs 600x450)? If so, should I resize the evaluation data, too? (I did not want to do this as my application uses the original image size, and I want to evaluate how well the model does on that data).
Question 3: Is it a problem to form the training and evaluation data from images where there are multiple tagged objects per image (i.e. surely the evaluation routine compares all the predicted bounding boxes in one image to all the tagged bounding boxes in one image, and not all the predicted boxes in one image to one tagged box which would preduce many "false false positives"?)
(Question 4: it seems to me the model training could have been stopped after around 10000 timesteps were the mAP kind of leveled out, is it now overtrained? it's kind of hard to tell when it fluctuates so much.)
I am a newbie with object detection so I very much appreciate any insight anyone can offer! :)
Question 1: This is the tough one... First, I think you don't understand correctly what mAP is, since your rough calculation is false. Here is, briefly, how it is computed:
For each class of object, using the overlap between the real objects and the detected ones, the detections are tagged as "True positive" or "False positive"; all the real objects with no "True positive" associated to them are labelled "False Negative".
Then, iterate through all your detections (on all images of the dataset) in decreasing order of confidence. Compute the accuracy (TP/(TP+FP)) and recall (TP/(TP+FN)), only counting the detections that you've already seen ( with confidence bigger than the current one) for TP and FP. This gives you a point (acc, recc), that you can put on a precision-recall graph.
Once you've added all possible points to your graph, you compute the area under the curve: this is the Average Precision for this category
if you have multiple categories, the mAP is the standard mean of all APs.
Applying that to your case: in the best case your true positive are the detections with the best confidence. In that case your acc/rec curve will look like a rectangle: you'd have 100% accuracy up to (13/20) recall, and then points with 13/20 recall and <100% accuracy; this gives you mAP=AP(category 1)=13/20=0.65. And this is the best case, you can expect less in practice due to false positives which higher confidence.
Other reasons why yours could be lower:
maybe among the bounding boxes that appear to be good, some are still rejected in the calculations because the overlap between the detection and the real object is not quite big enough. The criterion is that Intersection over Union (IoU) of the two bounding boxes (real one and detection) should be over 0.5. While it seems like a gentle threshold, it's not really; you should probably try and write a script to display the detected bounding boxes with a different color depending on whether they're accepted or not (if not, you'll get both a FP and a FN).
maybe you're only visualizing the first 10 images of the evaluation. If so, change that, for 2 reasons: 1. maybe you're just very lucky on these images, and they're not representative of what follows, just by luck. 2. Actually, more than luck, if these images are the first from the evaluation set, they come right after the end of the training set in your video, so they are probably quite similar to some images in the training set, so they are easier to predict, so they're not representative of your evaluation set.
Question 2: if you have not changed that part in the config file mobilenet_v1_coco-model, all your images (both for training and testing) are rescaled to 300x300 pixels at the start of the network, so your preprocessings don't matter.
Question 3: no it's not a problem at all, all these algorithms were designed to detect multiple objects in images.
Question 4: Given the fluctuations, I'd actually keep training it until you can see improvement or clear overtraining. 10k steps is actually quite small, maybe it's enough because your task is relatively easy, maybe it's not enough and you need to wait ten times that to have significant improvement...